Discovering Hierarchical Software Engineering Agents via Bandit Optimization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Multi-armed banditModel selectionSoftware engineering

Large language models (LLMs) are increasingly applied to software engineering (SWE), but they struggle on real-world tasks that are long-horizon and often out of distribution. Current systems typically adopt monolithic designs where a single model attempts to interpret ambiguous issues, navigate large codebases, and implement fixes in one extended reasoning chain. This design makes it difficult to generalize beyond training data. Inspired by how human engineers decompose problems into sub-tasks, we argue that SWE agents should be structured as orchestrators coordinating specialized sub-agents, each responsible for a specific sub-task such as bug reproduction, fault localization, code modification, or validation. The central challenge is how to design these hierarchies effectively. Manual decompositions follow human workflows but often mismatch LLM capabilities, while automated search methods such as evolutionary strategies require evaluating a very large number of candidates, making them prohibitively expensive for SWE. We show that formulating hierarchy discovery as a multi-armed bandit problem enables efficient exploration of sub-agent designs under limited budgets. On SWE-bench-Verified, this approach outperforms single-agent systems and manually designed multi-agent systems. On SWE-bench-Live, which features recent and out-of-distribution issues, our system ranks 2nd on the leaderboard with a 36B model, surpassing larger systems such as GPT-4 and Claude. This provides the first evidence that hierarchical multi-agent systems improves generalization on challenging long-horizon SWE tasks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a hierarchical multi-agent system for software engineering tasks, formulating hierarchy discovery as a multi-armed bandit problem and introducing the BOAD method with hindsight-based credit assignment. It resides in the 'Hierarchical and Layered Agent Architectures' leaf, which contains five papers total including this one. This leaf sits within the broader 'Multi-Agent System Architectures and Frameworks' branch, indicating a moderately populated research direction focused on structural design rather than task-specific applications. The sibling papers explore related themes of layered reasoning and adaptive hierarchies, suggesting this is an active but not overcrowded area.

The taxonomy reveals neighboring work in 'General-Purpose Multi-Agent Frameworks' (six papers) and 'Collaborative Multi-Agent Workflows' (three papers), with task-specific applications distributed across debugging, code generation, and full SDLC automation. The paper's focus on discovering hierarchies through bandit optimization distinguishes it from manually designed frameworks like MetaGPT or role-based collaborations. The scope note for its leaf emphasizes parent-child relationships and tree-based models, while excluding flat collaborations, positioning this work at the intersection of structural design and adaptive learning rather than fixed workflow orchestration.

Among twenty-two candidates examined, the hindsight-based credit assignment contribution shows one refutable candidate from ten examined, suggesting some prior work on credit assignment mechanisms exists within the limited search scope. The bandit formulation and BOAD method contributions show zero refutable candidates from ten and two examined respectively, indicating these appear more novel within the top-K semantic matches analyzed. The relatively small candidate pool (twenty-two total) means the analysis captures closely related work but may not reflect the full breadth of hierarchical multi-agent research or bandit-based optimization in adjacent domains.

Based on the limited search scope of twenty-two semantically similar candidates, the work appears to occupy a distinct position combining bandit optimization with hierarchical agent design for software engineering. The taxonomy structure suggests this intersection is less explored than general frameworks or task-specific applications, though the single refutable pair for credit assignment indicates some conceptual overlap exists. The analysis does not cover exhaustive prior work in reinforcement learning, hierarchical planning, or broader software engineering automation literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Discovering hierarchical multi-agent systems for software engineering tasks. The field organizes around several complementary branches that together address how multiple agents can collaborate on complex software development activities. Multi-Agent System Architectures and Frameworks establish foundational designs—ranging from hierarchical and layered structures like those in Hierarchical Bandit Agents[0] and Agentmesh[1] to general-purpose platforms such as MetaGPT[14] and Agentscope[26]—that define how agents communicate, coordinate, and divide responsibilities. Task-Specific Multi-Agent Applications focus on concrete problem domains like code generation, debugging, and testing, while Collaborative Multi-Agent Workflows and Coordination examine interaction patterns, role assignment, and dynamic task decomposition. Domain-Specific Multi-Agent Implementations tailor these ideas to specialized contexts (e.g., agile development in LLM Multiagent Agile[3] or autonomous agile workflows in Autonomous Agile Agents[8]), and Cross-Cutting Concerns and Enabling Technologies address shared challenges such as memory management, knowledge representation, and evaluation frameworks. A particularly active line of work explores hierarchical and layered architectures that decompose software tasks into manageable subtasks, enabling agents to operate at different levels of abstraction. Hierarchical Bandit Agents[0] sits squarely in this space, emphasizing adaptive decision-making within a structured hierarchy, and shares conceptual ground with PC Agent[2] and BOAD[40], which also leverage layered reasoning to handle complex workflows. In contrast, works like SWE Agent[5] and SWE Debate[6] prioritize specialized tooling and debate-driven refinement for software engineering benchmarks, illustrating a trade-off between general hierarchical frameworks and task-optimized designs. Open questions remain around how to balance flexibility and specialization, how to dynamically adjust hierarchies as tasks evolve, and how to integrate human feedback into these layered structures. Hierarchical Bandit Agents[0] contributes to this landscape by combining bandit-based exploration with hierarchical decomposition, positioning itself among frameworks that seek both structural clarity and adaptive learning.

Claimed Contributions

Formulation of hierarchical multi-agent system design as a multi-armed bandit problem

10 retrieved papers

The authors formulate the discovery of hierarchical multi-agent systems as a multi-armed bandit (MAB) problem, where each arm corresponds to a sub-agent design. This formulation enables efficient exploration of sub-agent designs under limited evaluation budgets, addressing the challenge that evaluating candidates in software engineering is prohibitively expensive.

10 retrieved papers

Bandit Optimization for Agent Design (BOAD) method

2 retrieved papers

The authors introduce BOAD, a method that maintains an archive of candidate sub-agents and uses Upper Confidence Bound (UCB) strategy to balance exploration and exploitation. The method dynamically expands the archive using a Chinese Restaurant Process and employs hindsight-based credit assignment to evaluate individual sub-agent contributions, avoiding the free-rider problem.

2 retrieved papers

Hindsight-based credit assignment strategy for sub-agent evaluation

Can Refute

10 retrieved papers

The authors propose a hindsight-based credit assignment strategy that uses an LLM judge to assess whether individual sub-agents contributed meaningfully within a trajectory. This approach rewards sub-agents for helpful intermediate steps rather than relying solely on final success rates, thereby addressing the credit assignment problem and avoiding free-riding effects.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc PDF

Liu Haowei, Zhang Xi, Haowei Liu, Xu, Haiyang, Xi Zhang, Haiyang Xu, Wang, Junyang, Yuyang Wanyan, Yan Ming, Junyang Wang, Zhang Ji, Ming Yan, Yuan, Chunfeng, Ji Zhang, Changsheng, Chunfen Yuan, Hu, Weiming, Changsheng Xu, Huang Fei, Weiming Hu, Fei Huang (2025)

[18] Hierarchical multi-agent reinforcement learning PDF

Rajbala Makar (2006)

[40] BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization PDF

Iris Xu, Guangtao Zeng, Zexue He, Charles Jin, Aldo Pareja, Dan Gutfreund, Chuang Gan, Zhang-Wei Hong (2025)

[41] TALM: Dynamic Tree-Structured Multi-Agent Framework with Long-Term Memory for Scalable Code Generation PDF

Ming-Tung Shen, Yuh-Jzer Joung (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Formulation of hierarchical multi-agent system design as a multi-armed bandit problem

[63] MArBLE: Hierarchical multi-armed bandits for human-in-the-loop set expansion PDF

Cannot Refute

[64] Bilevel Multi-Armed Bandit-Based Hierarchical Reinforcement Learning for Interaction-Aware Self-Driving at Unsignalized Intersections PDF

Cannot Refute

[65] Developing Heuristics for Resource Allocation and Utilization in Systems Design: A Hierarchical Reinforcement Learning Approach PDF

Cannot Refute

[66] IEEE 802.11 bn Multi-AP Coordinated Spatial Reuse with Hierarchical Multi-Armed Bandits PDF

Cannot Refute

[67] Matching in multi-arm bandit with collision PDF

Cannot Refute

[68] Distributed Deep Multi-Agent Reinforcement Learning for Cooperative Edge Caching in Internet-of-Vehicles PDF

Cannot Refute

[69] Hierarchical Bayesian Bandits PDF

Cannot Refute

[70] Multilevel constrained bandits: A hierarchical upper confidence bound approach with safety guarantees PDF

Cannot Refute

[71] Hierarchical Multi-Armed Bandits for the Concurrent Intelligent Tutoring of Concepts and Problems of Varying Difficulty Levels PDF

Cannot Refute

[72] Top-k Multi-Armed Bandit Learning for Content Dissemination in Swarms of Micro-UAVs PDF

Cannot Refute

Contribution

Bandit Optimization for Agent Design (BOAD) method

[51] Quality Diversity Optimization: A Modular Framework and Continuous Density Estimation PDF

Cannot Refute

[52] Swarm and Evolutionary Computation PDF

Cannot Refute

Contribution

Hindsight-based credit assignment strategy for sub-agent evaluation

[53] Reflective multi-agent collaboration based on large language models PDF

Can Refute

[54] Leveraging Large Language Models for Effective and Explainable Multi-Agent Credit Assignment PDF

Cannot Refute

[55] STAS: Spatial-Temporal Return Decomposition for Solving Sparse Rewards Problems in Multi-agent Reinforcement Learning PDF

Cannot Refute

[56] Emergent Agentic Transformer from Chain of Hindsight Experience PDF

Cannot Refute

[57] MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment PDF

Cannot Refute

[58] Hindsight-aware deep reinforcement learning algorithm for multi-agent systems PDF

Cannot Refute

[59] On actions that matter: Credit assignment and interpretability in reinforcement learning PDF

Cannot Refute

[60] Improving value factorization for multi-agent deep reinforcement learning via individual contribution PDF

Cannot Refute

[61] Cooperative Multi-Agent Transfer Learning with Level-Adaptive Credit Assignment PDF

Cannot Refute

[62] Multi-Agent Credit Assignment and Bankruptcy Game for Improving Resource Allocation in Smart Cities PDF

Cannot Refute

Discovering Hierarchical Software Engineering Agents via Bandit Optimization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc PDF

[18] Hierarchical multi-agent reinforcement learning PDF

[40] BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization PDF

[41] TALM: Dynamic Tree-Structured Multi-Agent Framework with Long-Term Memory for Scalable Code Generation PDF

Contribution Analysis

Formulation of hierarchical multi-agent system design as a multi-armed bandit problem

[63] MArBLE: Hierarchical multi-armed bandits for human-in-the-loop set expansion PDF

[64] Bilevel Multi-Armed Bandit-Based Hierarchical Reinforcement Learning for Interaction-Aware Self-Driving at Unsignalized Intersections PDF

[65] Developing Heuristics for Resource Allocation and Utilization in Systems Design: A Hierarchical Reinforcement Learning Approach PDF

[66] IEEE 802.11 bn Multi-AP Coordinated Spatial Reuse with Hierarchical Multi-Armed Bandits PDF

[67] Matching in multi-arm bandit with collision PDF

[68] Distributed Deep Multi-Agent Reinforcement Learning for Cooperative Edge Caching in Internet-of-Vehicles PDF

[69] Hierarchical Bayesian Bandits PDF

[70] Multilevel constrained bandits: A hierarchical upper confidence bound approach with safety guarantees PDF

[71] Hierarchical Multi-Armed Bandits for the Concurrent Intelligent Tutoring of Concepts and Problems of Varying Difficulty Levels PDF

[72] Top-k Multi-Armed Bandit Learning for Content Dissemination in Swarms of Micro-UAVs PDF

Bandit Optimization for Agent Design (BOAD) method

[51] Quality Diversity Optimization: A Modular Framework and Continuous Density Estimation PDF

[52] Swarm and Evolutionary Computation PDF

Hindsight-based credit assignment strategy for sub-agent evaluation

[53] Reflective multi-agent collaboration based on large language models PDF

[54] Leveraging Large Language Models for Effective and Explainable Multi-Agent Credit Assignment PDF

[55] STAS: Spatial-Temporal Return Decomposition for Solving Sparse Rewards Problems in Multi-agent Reinforcement Learning PDF

[56] Emergent Agentic Transformer from Chain of Hindsight Experience PDF

[57] MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment PDF

[58] Hindsight-aware deep reinforcement learning algorithm for multi-agent systems PDF

[59] On actions that matter: Credit assignment and interpretability in reinforcement learning PDF

[60] Improving value factorization for multi-agent deep reinforcement learning via individual contribution PDF

[61] Cooperative Multi-Agent Transfer Learning with Level-Adaptive Credit Assignment PDF

[62] Multi-Agent Credit Assignment and Bankruptcy Game for Improving Resource Allocation in Smart Cities PDF

Table of Contents