Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

ICLR 2026 Conference SubmissionAnonymous Authors
Real-time Music AccompanimentMusic GenerationReinforcement LearningAdversarial Machine Learning
Abstract:

Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptivity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other player’s future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as ``reward hacking'', affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a generative adversarial post-training (GAPT) method for melody-to-chord accompaniment in live jamming scenarios. It occupies the sole position in the 'GAN-Based Policy Regularization for Interactive Music' leaf, which itself is the only leaf under 'Adversarial Methods for Reward Hacking Mitigation'. With no sibling papers in its immediate taxonomy node and only two other papers across the entire taxonomy tree, the work sits in a notably sparse research direction within the broader landscape of RL-based music generation.

The taxonomy reveals three main branches: adversarial methods, reward engineering, and audio mixing control. The paper's adversarial approach contrasts with neighboring work in 'Modified Reward Functions for Composition Quality', which addresses reward exploitation through careful signal design rather than discriminator-based regularization. The 'Adaptive DJ Mixing Strategy Learning' branch tackles audio control tasks distinct from symbolic composition. The scope notes clarify that the paper's focus on real-time interactive systems with discriminator regularization differentiates it from offline composition methods and pure reward-shaping strategies.

Among twenty-four candidates examined, the GAPT method itself shows no clear refutation across ten candidates reviewed. However, the two-phase adaptive discriminator schedule encountered one refutable candidate among four examined, and the real-time interactive system evaluation found two refutable candidates among ten reviewed. These statistics suggest that while the core adversarial training framework may offer novelty, specific implementation choices around discriminator updates and interactive evaluation protocols have more substantial prior work within the limited search scope.

Based on the top-24 semantic matches examined, the work appears to explore a relatively under-populated intersection of adversarial training and live music interaction. The sparse taxonomy structure and limited sibling papers suggest this specific combination has received less attention than adjacent areas. However, the analysis does not cover exhaustive citation networks or domain-specific music generation venues, leaving open questions about related work in specialized music technology communities.

Taxonomy

Core-task Taxonomy Papers
2
3
Claimed Contributions
24
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Mitigating reward hacking in reinforcement learning post-training for music generation. The field structure suggested by the taxonomy reveals three main branches addressing distinct facets of applying RL to music. Adversarial Methods for Reward Hacking Mitigation explores techniques that use adversarial or discriminative signals to prevent policies from exploiting reward model weaknesses, often drawing on GAN-like frameworks to regularize learned behaviors in interactive or generative music settings. Reward Engineering and Shaping Approaches focuses on designing or refining reward functions themselves—through careful feature engineering, auxiliary objectives, or curriculum strategies—to guide RL agents toward musically meaningful outcomes without falling into degenerate solutions. RL for Audio Mixing and Control Tasks examines the application of RL to lower-level audio processing problems, such as dynamic range compression or equalization, where the challenge is to learn control policies that respect perceptual quality constraints. Together, these branches illustrate how the community tackles reward misspecification from complementary angles: adversarial regularization, principled reward design, and domain-specific control. A particularly active line of work within adversarial methods investigates GAN-based policy regularization for interactive music, where discriminators help distinguish genuine musical structure from artifacts that merely maximize a flawed reward signal. Adversarial Post-Training Music[0] sits squarely in this cluster, emphasizing post-training adversarial refinement to curb reward hacking in generative models. This contrasts with earlier efforts like Music RNN Reinforcement[2], which applied RL to sequence generation but relied more heavily on hand-crafted reward shaping rather than adversarial oversight. Meanwhile, Mixing Music DRL[1] operates in a different part of the landscape—focusing on audio mixing control tasks—highlighting how RL techniques for music span both high-level creative generation and low-level signal processing. The central open question across these branches remains how to balance exploration, perceptual fidelity, and computational cost when reward signals are inherently incomplete or subjective.

Claimed Contributions

Generative Adversarial Post-Training (GAPT) method

The authors introduce GAPT, a novel training method that combines RL post-training with adversarial learning. A discriminator is trained to distinguish policy-generated trajectories from real data, providing an adversarial reward signal that prevents the policy from collapsing to trivial, repetitive outputs while maintaining harmonic coherence.

10 retrieved papers
Two-phase adaptive discriminator update schedule

The authors develop a two-phase training schedule for stable adversarial learning. Phase one warms up the discriminator with fixed-interval updates, while phase two applies adaptive updates gated by a moving average threshold, addressing nonstationarity and preventing the discriminator from overpowering the policy.

4 retrieved papers
Can Refute
Real-time interactive music accompaniment system and evaluation

The authors apply their method to real-time melody-to-chord accompaniment and evaluate it through simulation with fixed melodies and learned agents, plus a user study with expert musicians using a deployed interactive system. They demonstrate improvements in diversity, harmonic coherence, adaptation speed, and perceived user agency.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Generative Adversarial Post-Training (GAPT) method

The authors introduce GAPT, a novel training method that combines RL post-training with adversarial learning. A discriminator is trained to distinguish policy-generated trajectories from real data, providing an adversarial reward signal that prevents the policy from collapsing to trivial, repetitive outputs while maintaining harmonic coherence.

Contribution

Two-phase adaptive discriminator update schedule

The authors develop a two-phase training schedule for stable adversarial learning. Phase one warms up the discriminator with fixed-interval updates, while phase two applies adaptive updates gated by a moving average threshold, addressing nonstationarity and preventing the discriminator from overpowering the policy.

Contribution

Real-time interactive music accompaniment system and evaluation

The authors apply their method to real-time melody-to-chord accompaniment and evaluate it through simulation with fixed melodies and learned agents, plus a user study with expert musicians using a deployed interactive system. They demonstrate improvements in diversity, harmonic coherence, adaptation speed, and perceived user agency.