Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Distributed Reinforcement LearningAgent Ensemble LearningAgent DiversityExploration Efficiency
Abstract:

Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability. In this work, we theoretically analyze the impact of inter-policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization (CPO), which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple dexterous manipulation tasks in both sample efficiency and final performance. Furthermore, analysis of policy diversity and effective sample size during training reveals that follower policies naturally distribute around the leader, demonstrating the emergence of structured and efficient exploratory behavior. Our results indicate that diverse exploration under appropriate regulation is key to achieving stable and sample-efficient learning in ensemble policy gradient methods.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Coupled Policy Optimization (CPO), which regulates diversity in ensemble policy gradient methods through KL constraints between policies, and provides theoretical analysis of how inter-policy diversity affects learning efficiency. Within the taxonomy, it resides in the 'Explicit Diversity Control in Multi-Agent Systems' leaf alongside one sibling paper ('Controlling Behavioral Diversity'). This leaf is part of the broader 'Policy Diversity Regulation and Control Mechanisms' branch, which contains only two leaves and five total papers. The placement suggests the paper addresses a relatively focused research direction within a moderately sparse area of the field.

The taxonomy reveals that neighboring work splits into two main directions: explicit diversity control (where this paper sits) versus implicit diversity enhancement through intrinsic motivation or variational methods. The 'Diversity Measurement and Analysis' sibling leaf contains three papers examining diversity-performance relationships without control mechanisms, while the parallel 'Exploration and Behavioral Diversity Enhancement' branch explores intrinsic rewards and parallel policy strategies. The scope notes clarify that CPO's explicit KL-based regulation distinguishes it from methods that promote diversity only implicitly, positioning it at the intersection of diversity control and scalable parallel training.

Among the eleven candidates examined through limited semantic search, none clearly refute any of the three identified contributions. The theoretical analysis of diversity impact examined ten candidates with zero refutations, while the CPO method itself examined one candidate with no overlap found. The empirical demonstration of structured policy formation examined zero candidates. This limited search scope—covering roughly half the taxonomy's twenty-two papers—suggests the analysis captures closely related work but may not reflect the full landscape. The absence of refutations among examined candidates indicates potential novelty within the search scope, though the small sample size limits definitive conclusions.

Based on the top-eleven semantic matches examined, the work appears to occupy a distinct position within explicit diversity control methods, particularly in its focus on massively parallel environments and KL-based regulation. However, the limited search scope and sparse taxonomy leaf (only two papers) make it difficult to assess whether similar theoretical frameworks or KL-constraint approaches exist in adjacent areas not captured by this analysis. The contribution-level statistics suggest novelty within the examined subset, but broader claims would require more comprehensive literature coverage.

Taxonomy

Core-task Taxonomy Papers
22
3
Claimed Contributions
11
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Regulating policy diversity in ensemble reinforcement learning under massively parallel environments. The field structure reflects a multifaceted approach to managing diverse agent behaviors at scale. The taxonomy organizes work into five main branches: Policy Diversity Regulation and Control Mechanisms focuses on explicit methods to shape and constrain behavioral variation among agents; Exploration and Behavioral Diversity Enhancement emphasizes techniques that encourage varied exploration strategies and emergent behaviors; Parallel and Distributed Multi-Agent Training Frameworks addresses the computational infrastructure needed to train large ensembles efficiently, such as Malib Parallel Framework[9]; Domain-Specific Multi-Agent Applications targets concrete problem settings like traffic simulation (TrafficSim[8]) and autonomous driving (Autonomous Vehicle Behaviors[5]); and Behavioral Control and Robustness in Multi-Agent Systems examines stability, coordination, and resilience when diverse policies interact. Together, these branches capture the tension between fostering diversity for exploration and maintaining coherent, controllable ensemble behavior. Several active lines of work highlight key trade-offs and open questions. One strand investigates how to measure and regulate diversity explicitly, balancing the benefits of varied policies against the risk of chaotic or redundant behaviors—Controlling Behavioral Diversity[1] and Behavioral Diversity Impact[2] exemplify this concern. Another strand explores intrinsic motivation and role differentiation (MAVEN[12], Role Diversity Diagnosis[16]) to encourage agents to specialize without manual intervention. Rethinking Policy Diversity[0] sits within the explicit diversity control cluster, closely aligned with Controlling Behavioral Diversity[1], yet it emphasizes scalable regulation mechanisms suited to massively parallel settings. Compared to Behavioral Diversity Impact[2], which analyzes diversity's effects on performance, Rethinking Policy Diversity[0] focuses more on the control levers that practitioners can adjust to maintain desired diversity levels under high parallelism, addressing both computational efficiency and behavioral coherence in large-scale ensembles.

Claimed Contributions

Theoretical analysis of policy diversity impact on ensemble policy gradient methods

The authors theoretically demonstrate that excessive divergence between leader and follower policies reduces effective sample size and increases gradient estimation bias in ensemble policy gradient methods, thereby harming both training stability and sample efficiency in large-scale reinforcement learning.

10 retrieved papers
Coupled Policy Optimization (CPO) method

The authors introduce CPO, a novel method that regulates policy diversity in ensemble reinforcement learning by constraining KL divergence between follower and leader policies during updates, combined with an adversarial reward mechanism to prevent policy overconcentration while maintaining exploration diversity.

1 retrieved paper
Empirical demonstration of structured policy formation in CPO

The authors empirically verify that their KL constraint mechanism leads to a stable and well-structured policy ensemble where follower policies naturally distribute around the leader without misalignment, avoiding the policy divergence issues observed in prior methods like SAPG.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical analysis of policy diversity impact on ensemble policy gradient methods

The authors theoretically demonstrate that excessive divergence between leader and follower policies reduces effective sample size and increases gradient estimation bias in ensemble policy gradient methods, thereby harming both training stability and sample efficiency in large-scale reinforcement learning.

Contribution

Coupled Policy Optimization (CPO) method

The authors introduce CPO, a novel method that regulates policy diversity in ensemble reinforcement learning by constraining KL divergence between follower and leader policies during updates, combined with an adversarial reward mechanism to prevent policy overconcentration while maintaining exploration diversity.

Contribution

Empirical demonstration of structured policy formation in CPO

The authors empirically verify that their KL constraint mechanism leads to a stable and well-structured policy ensemble where follower policies naturally distribute around the leader without misalignment, avoiding the policy divergence issues observed in prior methods like SAPG.