Discovering Hierarchical Software Engineering Agents via Bandit Optimization

ICLR 2026 Conference SubmissionAnonymous Authors
Multi-armed banditModel selectionSoftware engineering
Abstract:

Large language models (LLMs) are increasingly applied to software engineering (SWE), but they struggle on real-world tasks that are long-horizon and often out of distribution. Current systems typically adopt monolithic designs where a single model attempts to interpret ambiguous issues, navigate large codebases, and implement fixes in one extended reasoning chain. This design makes it difficult to generalize beyond training data. Inspired by how human engineers decompose problems into sub-tasks, we argue that SWE agents should be structured as orchestrators coordinating specialized sub-agents, each responsible for a specific sub-task such as bug reproduction, fault localization, code modification, or validation. The central challenge is how to design these hierarchies effectively. Manual decompositions follow human workflows but often mismatch LLM capabilities, while automated search methods such as evolutionary strategies require evaluating a very large number of candidates, making them prohibitively expensive for SWE. We show that formulating hierarchy discovery as a multi-armed bandit problem enables efficient exploration of sub-agent designs under limited budgets. On SWE-bench-Verified, this approach outperforms single-agent systems and manually designed multi-agent systems. On SWE-bench-Live, which features recent and out-of-distribution issues, our system ranks 2nd on the leaderboard with a 36B model, surpassing larger systems such as GPT-4 and Claude. This provides the first evidence that hierarchical multi-agent systems improves generalization on challenging long-horizon SWE tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a hierarchical multi-agent system for software engineering tasks, formulating hierarchy discovery as a multi-armed bandit problem and introducing the BOAD method with hindsight-based credit assignment. It resides in the 'Hierarchical and Layered Agent Architectures' leaf, which contains five papers total including this one. This leaf sits within the broader 'Multi-Agent System Architectures and Frameworks' branch, indicating a moderately populated research direction focused on structural design rather than task-specific applications. The sibling papers explore related themes of layered reasoning and adaptive hierarchies, suggesting this is an active but not overcrowded area.

The taxonomy reveals neighboring work in 'General-Purpose Multi-Agent Frameworks' (six papers) and 'Collaborative Multi-Agent Workflows' (three papers), with task-specific applications distributed across debugging, code generation, and full SDLC automation. The paper's focus on discovering hierarchies through bandit optimization distinguishes it from manually designed frameworks like MetaGPT or role-based collaborations. The scope note for its leaf emphasizes parent-child relationships and tree-based models, while excluding flat collaborations, positioning this work at the intersection of structural design and adaptive learning rather than fixed workflow orchestration.

Among twenty-two candidates examined, the hindsight-based credit assignment contribution shows one refutable candidate from ten examined, suggesting some prior work on credit assignment mechanisms exists within the limited search scope. The bandit formulation and BOAD method contributions show zero refutable candidates from ten and two examined respectively, indicating these appear more novel within the top-K semantic matches analyzed. The relatively small candidate pool (twenty-two total) means the analysis captures closely related work but may not reflect the full breadth of hierarchical multi-agent research or bandit-based optimization in adjacent domains.

Based on the limited search scope of twenty-two semantically similar candidates, the work appears to occupy a distinct position combining bandit optimization with hierarchical agent design for software engineering. The taxonomy structure suggests this intersection is less explored than general frameworks or task-specific applications, though the single refutable pair for credit assignment indicates some conceptual overlap exists. The analysis does not cover exhaustive prior work in reinforcement learning, hierarchical planning, or broader software engineering automation literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Discovering hierarchical multi-agent systems for software engineering tasks. The field organizes around several complementary branches that together address how multiple agents can collaborate on complex software development activities. Multi-Agent System Architectures and Frameworks establish foundational designs—ranging from hierarchical and layered structures like those in Hierarchical Bandit Agents[0] and Agentmesh[1] to general-purpose platforms such as MetaGPT[14] and Agentscope[26]—that define how agents communicate, coordinate, and divide responsibilities. Task-Specific Multi-Agent Applications focus on concrete problem domains like code generation, debugging, and testing, while Collaborative Multi-Agent Workflows and Coordination examine interaction patterns, role assignment, and dynamic task decomposition. Domain-Specific Multi-Agent Implementations tailor these ideas to specialized contexts (e.g., agile development in LLM Multiagent Agile[3] or autonomous agile workflows in Autonomous Agile Agents[8]), and Cross-Cutting Concerns and Enabling Technologies address shared challenges such as memory management, knowledge representation, and evaluation frameworks. A particularly active line of work explores hierarchical and layered architectures that decompose software tasks into manageable subtasks, enabling agents to operate at different levels of abstraction. Hierarchical Bandit Agents[0] sits squarely in this space, emphasizing adaptive decision-making within a structured hierarchy, and shares conceptual ground with PC Agent[2] and BOAD[40], which also leverage layered reasoning to handle complex workflows. In contrast, works like SWE Agent[5] and SWE Debate[6] prioritize specialized tooling and debate-driven refinement for software engineering benchmarks, illustrating a trade-off between general hierarchical frameworks and task-optimized designs. Open questions remain around how to balance flexibility and specialization, how to dynamically adjust hierarchies as tasks evolve, and how to integrate human feedback into these layered structures. Hierarchical Bandit Agents[0] contributes to this landscape by combining bandit-based exploration with hierarchical decomposition, positioning itself among frameworks that seek both structural clarity and adaptive learning.

Claimed Contributions

Formulation of hierarchical multi-agent system design as a multi-armed bandit problem

The authors formulate the discovery of hierarchical multi-agent systems as a multi-armed bandit (MAB) problem, where each arm corresponds to a sub-agent design. This formulation enables efficient exploration of sub-agent designs under limited evaluation budgets, addressing the challenge that evaluating candidates in software engineering is prohibitively expensive.

10 retrieved papers
Bandit Optimization for Agent Design (BOAD) method

The authors introduce BOAD, a method that maintains an archive of candidate sub-agents and uses Upper Confidence Bound (UCB) strategy to balance exploration and exploitation. The method dynamically expands the archive using a Chinese Restaurant Process and employs hindsight-based credit assignment to evaluate individual sub-agent contributions, avoiding the free-rider problem.

2 retrieved papers
Hindsight-based credit assignment strategy for sub-agent evaluation

The authors propose a hindsight-based credit assignment strategy that uses an LLM judge to assess whether individual sub-agents contributed meaningfully within a trajectory. This approach rewards sub-agents for helpful intermediate steps rather than relying solely on final success rates, thereby addressing the credit assignment problem and avoiding free-riding effects.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Formulation of hierarchical multi-agent system design as a multi-armed bandit problem

The authors formulate the discovery of hierarchical multi-agent systems as a multi-armed bandit (MAB) problem, where each arm corresponds to a sub-agent design. This formulation enables efficient exploration of sub-agent designs under limited evaluation budgets, addressing the challenge that evaluating candidates in software engineering is prohibitively expensive.

Contribution

Bandit Optimization for Agent Design (BOAD) method

The authors introduce BOAD, a method that maintains an archive of candidate sub-agents and uses Upper Confidence Bound (UCB) strategy to balance exploration and exploitation. The method dynamically expands the archive using a Chinese Restaurant Process and employs hindsight-based credit assignment to evaluate individual sub-agent contributions, avoiding the free-rider problem.

Contribution

Hindsight-based credit assignment strategy for sub-agent evaluation

The authors propose a hindsight-based credit assignment strategy that uses an LLM judge to assess whether individual sub-agents contributed meaningfully within a trajectory. This approach rewards sub-agents for helpful intermediate steps rather than relying solely on final success rates, thereby addressing the credit assignment problem and avoiding free-riding effects.