Bridging Piano Transcription and Rendering via Disentangled Score Content and Style

ICLR 2026 Conference SubmissionAnonymous Authors
piano transcriptionexpressive performance renderingdisentangled representation learning
Abstract:

Expressive performance rendering (EPR) and automatic piano transcription (APT) are fundamental yet inverse tasks in music information retrieval: EPR generates expressive performances from symbolic scores, while APT recovers scores from performances. Despite their dual nature, prior work has addressed them independently. In this paper, we propose a unified framework that jointly models EPR and APT by disentangling note-level score content and global performance style representations from both paired and unpaired data. Our framework is built on a transformer-based sequence-to-sequence (Seq2Seq) architecture and is trained using only sequence-aligned data, without requiring fine-grained note-level alignment. To automate the rendering process while ensuring stylistic compatibility with the score, we introduce an independent diffusion-based performance style recommendation (PSR) module that generates style embeddings directly from score content. This modular component supports both style transfer and flexible rendering across a range of expressive styles. Experimental results from both objective and subjective evaluations demonstrate that our framework achieves competitive performance on EPR and APT tasks, while enabling effective content–style disentanglement, reliable style transfer, and stylistically appropriate rendering. Demos are available at https://jointpianist.github.io/epr-apt/.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a unified transformer-based framework that jointly models expressive performance rendering and automatic piano transcription by disentangling note-level score content from global performance style. Within the taxonomy, it resides in the 'Disentangled Content-Style Representation Learning' leaf under 'Unified Frameworks for Bidirectional Score-Performance Modeling'. This leaf contains only two papers total, including the original work, indicating a relatively sparse research direction. The framework's dual-task formulation positions it at the intersection of transcription and synthesis, a niche that few prior systems explicitly target through shared disentangled representations.

The taxonomy reveals that neighboring leaves address related but distinct challenges. 'Multi-Task Source Separation and Synthesis' integrates pitch-timbre disentanglement for separation tasks, while 'End-to-End Performance-to-Score Transcription' focuses on transformer-based transcription without joint rendering. The 'Neural Audio Synthesis from Symbolic Input' and 'Diffusion-Based Music Synthesis' leaves emphasize synthesis quality over bidirectional modeling. The scope notes clarify that systems addressing only one direction belong elsewhere, underscoring that true joint modeling with explicit content-style separation remains underexplored. The paper's approach diverges from purely generative or transcription-only methods by enforcing bidirectional consistency through shared representations.

Among the three contributions analyzed, the unified transformer model examined six candidates and found one refutable prior work, suggesting moderate overlap in the core architecture. The diffusion-based performance style recommendation module examined ten candidates with no refutations, indicating stronger novelty in this component. The sequence-to-sequence formulation without note-level alignment examined three candidates and found one refutable example. These statistics reflect a limited search scope of nineteen total candidates, not an exhaustive survey. The style recommendation module appears most distinctive, while the joint modeling framework and sequence formulation show some precedent in the examined literature.

Based on the limited search scope, the work demonstrates moderate novelty in a sparsely populated research direction. The diffusion-based style module and the integration of bidirectional tasks through disentanglement offer fresh contributions, though the core transformer architecture shows partial overlap with prior joint modeling efforts. The analysis covers top-K semantic matches and does not claim exhaustive coverage of all relevant prior work. Future assessments would benefit from broader literature exploration, particularly in adjacent areas like multi-task learning and style transfer.

Taxonomy

Core-task Taxonomy Papers
15
3
Claimed Contributions
19
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: joint modeling of expressive performance rendering and automatic piano transcription. The field encompasses several major branches that address complementary aspects of music understanding and generation. Unified frameworks for bidirectional score-performance modeling seek to bridge the gap between symbolic notation and expressive audio, often learning shared representations that support both transcription and synthesis. Automatic piano transcription systems focus on converting audio recordings into symbolic scores, tackling challenges such as polyphonic note detection and timing precision. Expressive performance rendering and synthesis methods generate realistic performances from scores by modeling timing, dynamics, and articulation. Deep learning architectures for music composition and performance explore novel neural designs for creative tasks, while datasets and computational methods provide the empirical foundation for training and evaluating these models. Representative works such as Joint Piano Rendering[7] and Transformer Piano Expressiveness[3] illustrate how these branches intersect and inform one another. A particularly active line of work involves disentangling content from style in learned representations, enabling models to separately manipulate musical structure and performer-specific nuances. Disentangled Piano Transcription[0] sits squarely within this cluster, emphasizing the separation of note content from expressive attributes during transcription. This approach contrasts with methods like Neural Piano Synthesis[5], which prioritize high-fidelity audio generation, and Performance-MIDI to Score[4], which addresses the inverse problem of recovering clean scores from expressive performances. Nearby works such as Joint Piano Rendering[7] also explore bidirectional modeling but may differ in how they balance transcription accuracy against synthesis quality. The central tension across these studies revolves around whether to pursue end-to-end joint optimization or modular pipelines, and how best to leverage large-scale datasets like ATEPP Dataset[10] to capture the rich variability of human performance.

Claimed Contributions

Unified transformer-based model for joint EPR and APT with disentangled representations

The authors propose a unified framework that jointly models expressive performance rendering and automatic piano transcription by learning disentangled note-level score content and global performance style representations. This joint formulation enables bidirectional modeling between symbolic and expressive forms of music using only sequence-aligned data without requiring fine-grained note-level alignment.

6 retrieved papers
Can Refute
Diffusion-based performance style recommendation module

The authors introduce an independent PSR module that generates diverse and appropriate style embeddings conditioned solely on score content. This module mimics a pianist's ability to infer suitable expressive styles from the written score and enables controllable and non-expert-driven performance rendering.

10 retrieved papers
Sequence-to-sequence formulation of EPR without note-level alignment

The authors formulate expressive performance rendering as a sequence-to-sequence task that eliminates the need for note-aligned training data and enables scalable learning using only sequence-level supervision. Despite this relaxed supervision, the model achieves competitive performance compared to alignment-dependent baselines.

3 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Unified transformer-based model for joint EPR and APT with disentangled representations

The authors propose a unified framework that jointly models expressive performance rendering and automatic piano transcription by learning disentangled note-level score content and global performance style representations. This joint formulation enables bidirectional modeling between symbolic and expressive forms of music using only sequence-aligned data without requiring fine-grained note-level alignment.

Contribution

Diffusion-based performance style recommendation module

The authors introduce an independent PSR module that generates diverse and appropriate style embeddings conditioned solely on score content. This module mimics a pianist's ability to infer suitable expressive styles from the written score and enables controllable and non-expert-driven performance rendering.

Contribution

Sequence-to-sequence formulation of EPR without note-level alignment

The authors formulate expressive performance rendering as a sequence-to-sequence task that eliminates the need for note-aligned training data and enables scalable learning using only sequence-level supervision. Despite this relaxed supervision, the model achieves competitive performance compared to alignment-dependent baselines.