Scalable Supervising Software Agents with Patch Reasoner

ICLR 2026 Conference SubmissionAnonymous Authors
LLMSoftware EngineeringAgent
Abstract:

While large language model agents have advanced software engineering tasks, the unscalable nature of existing test-based supervision is limiting the potential improvement of data scaling. The reason is twofold: (1) building and running test sandbox is rather heavy and fragile, and (2) data with high-coverage tests is naturally rare and threatened by test hacking via edge cases. In this paper, we propose R4P, a patch verifier model to provide scalable rewards for training and testing SWE agents via reasoning. We consider that patch verification is fundamentally a reasoning task, mirroring how human repository maintainers review patches without writing and running new reproduction tests. To obtain sufficient reference and reduce the risk of reward hacking, R4P uses a group-wise objective for RL training, enabling it to verify multiple patches against each other's modification and gain a dense reward for stable training. R4P achieves 72.2% Acc. for verifying patches from SWE-bench-verified, surpassing OpenAI o3. To demonstrate R4P's practicality, we design and train a lite scaffold, Mini-SE, with pure reinforcement learning where all rewards are derived from R4P. As a result, Mini-SE achieves 26.2% Pass@1 on SWE-bench-verified, showing a 10.0% improvement over the original Qwen3-32B. This can be further improved to 33.8% with R4P for test-time scaling. The stable scaling curves in both RL test rewards and test-time accuracy reflect R4P's practical utility for scalable supervision on software agents.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces R4P, a reasoning-based patch verifier that provides scalable supervision for training software engineering agents without executing tests. It resides in the Reasoning-Based Patch Verification leaf, which contains only two papers including this one. This is a notably sparse research direction within the broader taxonomy of 42 papers across the field. The sibling paper in this leaf focuses on formal verification using Dafny, suggesting that reasoning-based approaches without test execution remain relatively underexplored compared to the more populated Test-Based Patch Validation category, which contains three papers emphasizing test suite execution.

The taxonomy reveals that most patch validation work clusters around test-based methods or hybrid approaches combining tests with heuristics. The Hybrid Verification Approaches leaf contains two papers, while neighboring branches like Bug Fixing and Automated Repair contain multiple papers relying primarily on test outcomes for validation. The paper's emphasis on reasoning mirrors human code review processes, positioning it closer to formal verification traditions than to the execution-heavy workflows dominant in multi-agent collaborative systems and planning-based architectures. This creates a clear boundary: R4P avoids the fragility and scalability issues of test sandboxes that characterize most validation approaches in adjacent categories.

Among 23 candidates examined across three contributions, none were found to clearly refute the paper's claims. The core R4P verifier contribution examined 10 candidates with zero refutable matches, suggesting limited direct prior work on reasoning-based patch verification at this scale. The group-wise training objective examined only 3 candidates, also with no refutations, indicating this training strategy may be relatively novel within the limited search scope. The Mini-SE agent scaffold examined 10 candidates with no refutations, though this may reflect the specific combination of pure RL training with reasoning-based rewards rather than the broader concept of lightweight agent architectures.

Based on the limited search scope of 23 semantically similar papers, the work appears to occupy a relatively unexplored position combining reasoning-based verification with reinforcement learning for agent training. The sparse population of its taxonomy leaf and absence of refuting candidates suggest novelty, though the analysis cannot rule out relevant work outside the top-K semantic matches examined. The 72.2% accuracy claim on SWE-bench-verified and comparison to OpenAI o3 would benefit from broader literature context to assess whether similar verification performance has been reported elsewhere.

Taxonomy

Core-task Taxonomy Papers
42
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Scalable supervision for software engineering agents through patch verification. The field has evolved around the challenge of automatically generating, validating, and deploying code changes at scale. The taxonomy reveals several major branches: Patch Verification and Validation Methods explore how to determine whether generated patches are correct and safe, ranging from test-based approaches to reasoning-based verification systems like Dafny Verification[23]. Agent Architectures and Workflows examine the design patterns and execution strategies that enable autonomous software engineering, while Training and Scaling Methodologies address how to build datasets and learning signals that improve agent performance over time. Parallel branches focus on Vulnerability Detection and Exploitation, Bug Fixing and Automated Repair, and domain-specific applications, reflecting the breadth of tasks that software agents now tackle. Evaluation and Benchmarking provides the measurement infrastructure, and Infrastructure and Tooling supports the practical deployment of these systems. Within this landscape, a particularly active tension exists between test-driven validation—where patches are accepted if they pass existing test suites—and more rigorous verification approaches that reason about correctness guarantees. Patch Reasoner[0] sits squarely in the reasoning-based verification cluster, emphasizing formal or semantic analysis to validate patches beyond simple test execution. This contrasts with many works in the Bug Fixing branch that rely primarily on test outcomes, and also differs from vulnerability-focused efforts like CVE Verifiable Exploits[5] that prioritize exploit generation over patch correctness proofs. Nearby, Dafny Verification[23] shares the formal verification emphasis, while works like Faultline[3] and PatchPilot[4] blend localization and repair with lighter-weight validation. The central open question remains how to scale rigorous verification without sacrificing the speed and flexibility that make agent-based repair practical for real-world codebases.

Claimed Contributions

R4P: A reasoning-based patch verifier model for scalable supervision

The authors introduce R4P, a model that verifies software patches through reasoning rather than test execution. It uses a group-wise objective to compare multiple patches against each other, providing dense rewards for stable reinforcement learning training without requiring golden tests, developer patches, agent trajectories, or runtime sandboxes.

10 retrieved papers
Group-wise training objective for patch verification

The authors propose a group-wise training approach where R4P assesses each patch by comparing it against others in a candidate set. This formulation provides mutual contextual information to compensate for the absence of tests, reduces reward hacking risk, and offers denser rewards than binary classification for more stable convergence.

3 retrieved papers
Mini-SE: A lite execution-free agentic scaffold trained with pure RL

The authors develop Mini-SE, a lightweight software engineering agent with issue-resolving code search and edit capabilities. It is trained using pure reinforcement learning supervised entirely by R4P without test execution during rollout, demonstrating the practical utility of reasoning-based verification for scalable agent training.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

R4P: A reasoning-based patch verifier model for scalable supervision

The authors introduce R4P, a model that verifies software patches through reasoning rather than test execution. It uses a group-wise objective to compare multiple patches against each other, providing dense rewards for stable reinforcement learning training without requiring golden tests, developer patches, agent trajectories, or runtime sandboxes.

Contribution

Group-wise training objective for patch verification

The authors propose a group-wise training approach where R4P assesses each patch by comparing it against others in a candidate set. This formulation provides mutual contextual information to compensate for the absence of tests, reduces reward hacking risk, and offers denser rewards than binary classification for more stable convergence.

Contribution

Mini-SE: A lite execution-free agentic scaffold trained with pure RL

The authors develop Mini-SE, a lightweight software engineering agent with issue-resolving code search and edit capabilities. It is trained using pure reinforcement learning supervised entirely by R4P without test execution during rollout, demonstrating the practical utility of reasoning-based verification for scalable agent training.

Scalable Supervising Software Agents with Patch Reasoner | Novelty Validation