Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Real-time Music AccompanimentMusic GenerationReinforcement LearningAdversarial Machine Learning

Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptivity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other player’s future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as ``reward hacking'', affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a generative adversarial post-training (GAPT) method for melody-to-chord accompaniment in live jamming scenarios. It occupies the sole position in the 'GAN-Based Policy Regularization for Interactive Music' leaf, which itself is the only leaf under 'Adversarial Methods for Reward Hacking Mitigation'. With no sibling papers in its immediate taxonomy node and only two other papers across the entire taxonomy tree, the work sits in a notably sparse research direction within the broader landscape of RL-based music generation.

The taxonomy reveals three main branches: adversarial methods, reward engineering, and audio mixing control. The paper's adversarial approach contrasts with neighboring work in 'Modified Reward Functions for Composition Quality', which addresses reward exploitation through careful signal design rather than discriminator-based regularization. The 'Adaptive DJ Mixing Strategy Learning' branch tackles audio control tasks distinct from symbolic composition. The scope notes clarify that the paper's focus on real-time interactive systems with discriminator regularization differentiates it from offline composition methods and pure reward-shaping strategies.

Among twenty-four candidates examined, the GAPT method itself shows no clear refutation across ten candidates reviewed. However, the two-phase adaptive discriminator schedule encountered one refutable candidate among four examined, and the real-time interactive system evaluation found two refutable candidates among ten reviewed. These statistics suggest that while the core adversarial training framework may offer novelty, specific implementation choices around discriminator updates and interactive evaluation protocols have more substantial prior work within the limited search scope.

Based on the top-24 semantic matches examined, the work appears to explore a relatively under-populated intersection of adversarial training and live music interaction. The sparse taxonomy structure and limited sibling papers suggest this specific combination has received less attention than adjacent areas. However, the analysis does not cover exhaustive citation networks or domain-specific music generation venues, leaving open questions about related work in specialized music technology communities.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Mitigating reward hacking in reinforcement learning post-training for music generation. The field structure suggested by the taxonomy reveals three main branches addressing distinct facets of applying RL to music. Adversarial Methods for Reward Hacking Mitigation explores techniques that use adversarial or discriminative signals to prevent policies from exploiting reward model weaknesses, often drawing on GAN-like frameworks to regularize learned behaviors in interactive or generative music settings. Reward Engineering and Shaping Approaches focuses on designing or refining reward functions themselves—through careful feature engineering, auxiliary objectives, or curriculum strategies—to guide RL agents toward musically meaningful outcomes without falling into degenerate solutions. RL for Audio Mixing and Control Tasks examines the application of RL to lower-level audio processing problems, such as dynamic range compression or equalization, where the challenge is to learn control policies that respect perceptual quality constraints. Together, these branches illustrate how the community tackles reward misspecification from complementary angles: adversarial regularization, principled reward design, and domain-specific control. A particularly active line of work within adversarial methods investigates GAN-based policy regularization for interactive music, where discriminators help distinguish genuine musical structure from artifacts that merely maximize a flawed reward signal. Adversarial Post-Training Music[0] sits squarely in this cluster, emphasizing post-training adversarial refinement to curb reward hacking in generative models. This contrasts with earlier efforts like Music RNN Reinforcement[2], which applied RL to sequence generation but relied more heavily on hand-crafted reward shaping rather than adversarial oversight. Meanwhile, Mixing Music DRL[1] operates in a different part of the landscape—focusing on audio mixing control tasks—highlighting how RL techniques for music span both high-level creative generation and low-level signal processing. The central open question across these branches remains how to balance exploration, perceptual fidelity, and computational cost when reward signals are inherently incomplete or subjective.

Claimed Contributions

Generative Adversarial Post-Training (GAPT) method

10 retrieved papers

The authors introduce GAPT, a novel training method that combines RL post-training with adversarial learning. A discriminator is trained to distinguish policy-generated trajectories from real data, providing an adversarial reward signal that prevents the policy from collapsing to trivial, repetitive outputs while maintaining harmonic coherence.

10 retrieved papers

Two-phase adaptive discriminator update schedule

Can Refute

4 retrieved papers

The authors develop a two-phase training schedule for stable adversarial learning. Phase one warms up the discriminator with fixed-interval updates, while phase two applies adaptive updates gated by a moving average threshold, addressing nonstationarity and preventing the discriminator from overpowering the policy.

4 retrieved papers

Can Refute

Real-time interactive music accompaniment system and evaluation

Can Refute

10 retrieved papers

The authors apply their method to real-time melody-to-chord accompaniment and evaluate it through simulation with fixed melodies and learned agents, plus a user study with expert musicians using a deployed interactive system. They demonstrate improvements in diversity, harmonic coherence, adaptation speed, and perceived user agency.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Generative Adversarial Post-Training (GAPT) method

[3] Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning PDF

Cannot Refute

[4] Shaping rewards for reinforcement learning with imperfect demonstrations using generative models PDF

Cannot Refute

[5] Adaptive Congestion Control for Real-Time Communication Using Deep Reinforcement Learning With Generative Adversarial Networks PDF

Cannot Refute

[6] Agentic AI Reinforcement Learning and Security PDF

Cannot Refute

[7] Adversarial RL for Hard-Negative Code Generation PDF

Cannot Refute

[8] Optiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learning PDF

Cannot Refute

[9] The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation PDF

Cannot Refute

[10] Security of deep reinforcement learning PDF

Cannot Refute

[11] Teacher-apprentices RL (TARL): leveraging complex policy distribution through generative adversarial hypernetwork in reinforcement learning PDF

Cannot Refute

[12] Reliable and Responsible Foundation Models PDF

Cannot Refute

Contribution

Two-phase adaptive discriminator update schedule

[25] Black-Box On-Policy Distillation of Large Language Models PDF

Can Refute

[23] Low-shot Defect Detection Method for Metal Multi-surface under Dynamic Uncertain Manufacturing Scenarios via Data Decoupling Augmentation PDF

Cannot Refute

[24] Improving Adversarial Transferability with Scheduled Step Size and Dual Example PDF

Cannot Refute

[26] Single-step Adversarial training with Dropout Scheduling PDF

Cannot Refute

Contribution

Real-time interactive music accompaniment system and evaluation

[13] Adaptive accompaniment with ReaLchords PDF

Can Refute

[14] Rl-duet: Online music accompaniment generation using deep reinforcement learning PDF

Can Refute

[15] Designing live human-ai collaboration for musical improvisation PDF

Cannot Refute

[16] Automatic note generator for Javanese gamelan music accompaniment using deep learning PDF

Cannot Refute

[17] AccoMontage2: A Complete Harmonization and Accompaniment Arrangement System PDF

Cannot Refute

[18] Development of an automatic emotional music accompaniment system by fuzzy logic and adaptive partition evolutionary genetic algorithm PDF

Cannot Refute

[19] Emotion-flow guided music accompaniment generation PDF

Cannot Refute

[20] On the adaptability of recurrent neural networks for real-time jazz improvisation accompaniment PDF

Cannot Refute

[21] Arranging music for the real world: classical and commercial aspects PDF

Cannot Refute

[22] An LSTM-based dynamic chord progression generation system for interactive music performance PDF

Cannot Refute

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Generative Adversarial Post-Training (GAPT) method

[3] Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning PDF

[4] Shaping rewards for reinforcement learning with imperfect demonstrations using generative models PDF

[5] Adaptive Congestion Control for Real-Time Communication Using Deep Reinforcement Learning With Generative Adversarial Networks PDF

[6] Agentic AI Reinforcement Learning and Security PDF

[7] Adversarial RL for Hard-Negative Code Generation PDF

[8] Optiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learning PDF

[9] The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation PDF

[10] Security of deep reinforcement learning PDF

[11] Teacher-apprentices RL (TARL): leveraging complex policy distribution through generative adversarial hypernetwork in reinforcement learning PDF

[12] Reliable and Responsible Foundation Models PDF

Two-phase adaptive discriminator update schedule

[25] Black-Box On-Policy Distillation of Large Language Models PDF

[23] Low-shot Defect Detection Method for Metal Multi-surface under Dynamic Uncertain Manufacturing Scenarios via Data Decoupling Augmentation PDF

[24] Improving Adversarial Transferability with Scheduled Step Size and Dual Example PDF

[26] Single-step Adversarial training with Dropout Scheduling PDF

Real-time interactive music accompaniment system and evaluation

[13] Adaptive accompaniment with ReaLchords PDF

[14] Rl-duet: Online music accompaniment generation using deep reinforcement learning PDF

[15] Designing live human-ai collaboration for musical improvisation PDF

[16] Automatic note generator for Javanese gamelan music accompaniment using deep learning PDF

[17] AccoMontage2: A Complete Harmonization and Accompaniment Arrangement System PDF

[18] Development of an automatic emotional music accompaniment system by fuzzy logic and adaptive partition evolutionary genetic algorithm PDF

[19] Emotion-flow guided music accompaniment generation PDF

[20] On the adaptability of recurrent neural networks for real-time jazz improvisation accompaniment PDF

[21] Arranging music for the real world: classical and commercial aspects PDF

[22] An LSTM-based dynamic chord progression generation system for interactive music performance PDF

Table of Contents