Bridging Piano Transcription and Rendering via Disentangled Score Content and Style

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

piano transcriptionexpressive performance renderingdisentangled representation learning

Expressive performance rendering (EPR) and automatic piano transcription (APT) are fundamental yet inverse tasks in music information retrieval: EPR generates expressive performances from symbolic scores, while APT recovers scores from performances. Despite their dual nature, prior work has addressed them independently. In this paper, we propose a unified framework that jointly models EPR and APT by disentangling note-level score content and global performance style representations from both paired and unpaired data. Our framework is built on a transformer-based sequence-to-sequence (Seq2Seq) architecture and is trained using only sequence-aligned data, without requiring fine-grained note-level alignment. To automate the rendering process while ensuring stylistic compatibility with the score, we introduce an independent diffusion-based performance style recommendation (PSR) module that generates style embeddings directly from score content. This modular component supports both style transfer and flexible rendering across a range of expressive styles. Experimental results from both objective and subjective evaluations demonstrate that our framework achieves competitive performance on EPR and APT tasks, while enabling effective content–style disentanglement, reliable style transfer, and stylistically appropriate rendering. Demos are available at https://jointpianist.github.io/epr-apt/.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a unified transformer-based framework that jointly models expressive performance rendering and automatic piano transcription by disentangling note-level score content from global performance style. Within the taxonomy, it resides in the 'Disentangled Content-Style Representation Learning' leaf under 'Unified Frameworks for Bidirectional Score-Performance Modeling'. This leaf contains only two papers total, including the original work, indicating a relatively sparse research direction. The framework's dual-task formulation positions it at the intersection of transcription and synthesis, a niche that few prior systems explicitly target through shared disentangled representations.

The taxonomy reveals that neighboring leaves address related but distinct challenges. 'Multi-Task Source Separation and Synthesis' integrates pitch-timbre disentanglement for separation tasks, while 'End-to-End Performance-to-Score Transcription' focuses on transformer-based transcription without joint rendering. The 'Neural Audio Synthesis from Symbolic Input' and 'Diffusion-Based Music Synthesis' leaves emphasize synthesis quality over bidirectional modeling. The scope notes clarify that systems addressing only one direction belong elsewhere, underscoring that true joint modeling with explicit content-style separation remains underexplored. The paper's approach diverges from purely generative or transcription-only methods by enforcing bidirectional consistency through shared representations.

Among the three contributions analyzed, the unified transformer model examined six candidates and found one refutable prior work, suggesting moderate overlap in the core architecture. The diffusion-based performance style recommendation module examined ten candidates with no refutations, indicating stronger novelty in this component. The sequence-to-sequence formulation without note-level alignment examined three candidates and found one refutable example. These statistics reflect a limited search scope of nineteen total candidates, not an exhaustive survey. The style recommendation module appears most distinctive, while the joint modeling framework and sequence formulation show some precedent in the examined literature.

Based on the limited search scope, the work demonstrates moderate novelty in a sparsely populated research direction. The diffusion-based style module and the integration of bidirectional tasks through disentanglement offer fresh contributions, though the core transformer architecture shows partial overlap with prior joint modeling efforts. The analysis covers top-K semantic matches and does not claim exhaustive coverage of all relevant prior work. Future assessments would benefit from broader literature exploration, particularly in adjacent areas like multi-task learning and style transfer.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: joint modeling of expressive performance rendering and automatic piano transcription. The field encompasses several major branches that address complementary aspects of music understanding and generation. Unified frameworks for bidirectional score-performance modeling seek to bridge the gap between symbolic notation and expressive audio, often learning shared representations that support both transcription and synthesis. Automatic piano transcription systems focus on converting audio recordings into symbolic scores, tackling challenges such as polyphonic note detection and timing precision. Expressive performance rendering and synthesis methods generate realistic performances from scores by modeling timing, dynamics, and articulation. Deep learning architectures for music composition and performance explore novel neural designs for creative tasks, while datasets and computational methods provide the empirical foundation for training and evaluating these models. Representative works such as Joint Piano Rendering[7] and Transformer Piano Expressiveness[3] illustrate how these branches intersect and inform one another. A particularly active line of work involves disentangling content from style in learned representations, enabling models to separately manipulate musical structure and performer-specific nuances. Disentangled Piano Transcription[0] sits squarely within this cluster, emphasizing the separation of note content from expressive attributes during transcription. This approach contrasts with methods like Neural Piano Synthesis[5], which prioritize high-fidelity audio generation, and Performance-MIDI to Score[4], which addresses the inverse problem of recovering clean scores from expressive performances. Nearby works such as Joint Piano Rendering[7] also explore bidirectional modeling but may differ in how they balance transcription accuracy against synthesis quality. The central tension across these studies revolves around whether to pursue end-to-end joint optimization or modular pipelines, and how best to leverage large-scale datasets like ATEPP Dataset[10] to capture the rich variability of human performance.

Claimed Contributions

Unified transformer-based model for joint EPR and APT with disentangled representations

Can Refute

6 retrieved papers

The authors propose a unified framework that jointly models expressive performance rendering and automatic piano transcription by learning disentangled note-level score content and global performance style representations. This joint formulation enables bidirectional modeling between symbolic and expressive forms of music using only sequence-aligned data without requiring fine-grained note-level alignment.

6 retrieved papers

Can Refute

Diffusion-based performance style recommendation module

10 retrieved papers

The authors introduce an independent PSR module that generates diverse and appropriate style embeddings conditioned solely on score content. This module mimics a pianist's ability to infer suitable expressive styles from the written score and enables controllable and non-expert-driven performance rendering.

10 retrieved papers

Sequence-to-sequence formulation of EPR without note-level alignment

Can Refute

3 retrieved papers

The authors formulate expressive performance rendering as a sequence-to-sequence task that eliminates the need for note-aligned training data and enables scalable learning using only sequence-level supervision. Despite this relaxed supervision, the model achieves competitive performance compared to alignment-dependent baselines.

3 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] Disentangling Score Content and Performance Style for Joint Piano Rendering and Transcription PDF

Zeng Wei, Zhao Jun-chuan, Wang Ye (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Unified transformer-based model for joint EPR and APT with disentangled representations

[7] Disentangling Score Content and Performance Style for Joint Piano Rendering and Transcription PDF

Can Refute

[16] Exploring Classical Piano Performance Generation with Expressive Music Variational AutoEncoder PDF

Cannot Refute

[17] Disentangling the Horowitz Factor: Learning Content and Style From Expressive Piano Performance PDF

Cannot Refute

[18] Sketching the Expression: Flexible Rendering of Expressive Piano Performance with Self-Supervised Learning PDF

Cannot Refute

[19] Interactive Audio Sculpting: Plugin Customization and UI Affordances in Immersive Environments PDF

Cannot Refute

[20] Improving Conditional Generation of Musical Components: Focusing on Chord and Expression PDF

Cannot Refute

Contribution

Diffusion-based performance style recommendation module

[2] Multi-Aspect Conditioning for Diffusion-Based Music Synthesis: Enhancing Realism and Acoustic Control PDF

Cannot Refute

[23] Symbolic Music Generation with Diffusion Models PDF

Cannot Refute

[24] Seed-music: A unified framework for high quality and controlled music generation PDF

Cannot Refute

[25] DExter: Learning and Controlling Performance Expression with Diffusion Models PDF

Cannot Refute

[26] Why perturbing symbolic music is necessary: Fitting the distribution of never-used notes through a joint probabilistic diffusion model PDF

Cannot Refute

[27] Emotionally Guided Symbolic Music Generation Using Diffusion Models: The AGE-DM Approach PDF

Cannot Refute

[28] DiffVel: Note-Level MIDI Velocity Estimation for Piano Performance by a Double Conditioned Diffusion Model PDF

Cannot Refute

[29] Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls PDF

Cannot Refute

[30] DiffuseRoll: Multi-track multi-category music generation based on diffusion model PDF

Cannot Refute

[31] ExpressiveSinger: Multilingual and Multi-Style Score-based Singing Voice Synthesis with Expressive Performance Control PDF

Cannot Refute

Contribution

Sequence-to-sequence formulation of EPR without note-level alignment

[7] Disentangling Score Content and Performance Style for Joint Piano Rendering and Transcription PDF

Can Refute

[21] Unaligned supervision for automatic music transcription in the wild PDF

Cannot Refute

[22] Real-Time MRI Video synthesis from time aligned phonemes with sequence-to-sequence networks PDF

Cannot Refute

Bridging Piano Transcription and Rendering via Disentangled Score Content and Style

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] Disentangling Score Content and Performance Style for Joint Piano Rendering and Transcription PDF

Contribution Analysis

Unified transformer-based model for joint EPR and APT with disentangled representations

[7] Disentangling Score Content and Performance Style for Joint Piano Rendering and Transcription PDF

[16] Exploring Classical Piano Performance Generation with Expressive Music Variational AutoEncoder PDF

[17] Disentangling the Horowitz Factor: Learning Content and Style From Expressive Piano Performance PDF

[18] Sketching the Expression: Flexible Rendering of Expressive Piano Performance with Self-Supervised Learning PDF

[19] Interactive Audio Sculpting: Plugin Customization and UI Affordances in Immersive Environments PDF

[20] Improving Conditional Generation of Musical Components: Focusing on Chord and Expression PDF

Diffusion-based performance style recommendation module

[2] Multi-Aspect Conditioning for Diffusion-Based Music Synthesis: Enhancing Realism and Acoustic Control PDF

[23] Symbolic Music Generation with Diffusion Models PDF

[24] Seed-music: A unified framework for high quality and controlled music generation PDF

[25] DExter: Learning and Controlling Performance Expression with Diffusion Models PDF

[26] Why perturbing symbolic music is necessary: Fitting the distribution of never-used notes through a joint probabilistic diffusion model PDF

[27] Emotionally Guided Symbolic Music Generation Using Diffusion Models: The AGE-DM Approach PDF

[28] DiffVel: Note-Level MIDI Velocity Estimation for Piano Performance by a Double Conditioned Diffusion Model PDF

[29] Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls PDF

[30] DiffuseRoll: Multi-track multi-category music generation based on diffusion model PDF

[31] ExpressiveSinger: Multilingual and Multi-Style Score-based Singing Voice Synthesis with Expressive Performance Control PDF

Sequence-to-sequence formulation of EPR without note-level alignment

[7] Disentangling Score Content and Performance Style for Joint Piano Rendering and Transcription PDF

[21] Unaligned supervision for automatic music transcription in the wild PDF

[22] Real-Time MRI Video synthesis from time aligned phonemes with sequence-to-sequence networks PDF

Table of Contents