Scalable Supervising Software Agents with Patch Reasoner

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

LLMSoftware EngineeringAgent

While large language model agents have advanced software engineering tasks, the unscalable nature of existing test-based supervision is limiting the potential improvement of data scaling. The reason is twofold: (1) building and running test sandbox is rather heavy and fragile, and (2) data with high-coverage tests is naturally rare and threatened by test hacking via edge cases. In this paper, we propose R4P, a patch verifier model to provide scalable rewards for training and testing SWE agents via reasoning. We consider that patch verification is fundamentally a reasoning task, mirroring how human repository maintainers review patches without writing and running new reproduction tests. To obtain sufficient reference and reduce the risk of reward hacking, R4P uses a group-wise objective for RL training, enabling it to verify multiple patches against each other's modification and gain a dense reward for stable training. R4P achieves 72.2% Acc. for verifying patches from SWE-bench-verified, surpassing OpenAI o3. To demonstrate R4P's practicality, we design and train a lite scaffold, Mini-SE, with pure reinforcement learning where all rewards are derived from R4P. As a result, Mini-SE achieves 26.2% Pass@1 on SWE-bench-verified, showing a 10.0% improvement over the original Qwen3-32B. This can be further improved to 33.8% with R4P for test-time scaling. The stable scaling curves in both RL test rewards and test-time accuracy reflect R4P's practical utility for scalable supervision on software agents.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces R4P, a reasoning-based patch verifier that provides scalable supervision for training software engineering agents without executing tests. It resides in the Reasoning-Based Patch Verification leaf, which contains only two papers including this one. This is a notably sparse research direction within the broader taxonomy of 42 papers across the field. The sibling paper in this leaf focuses on formal verification using Dafny, suggesting that reasoning-based approaches without test execution remain relatively underexplored compared to the more populated Test-Based Patch Validation category, which contains three papers emphasizing test suite execution.

The taxonomy reveals that most patch validation work clusters around test-based methods or hybrid approaches combining tests with heuristics. The Hybrid Verification Approaches leaf contains two papers, while neighboring branches like Bug Fixing and Automated Repair contain multiple papers relying primarily on test outcomes for validation. The paper's emphasis on reasoning mirrors human code review processes, positioning it closer to formal verification traditions than to the execution-heavy workflows dominant in multi-agent collaborative systems and planning-based architectures. This creates a clear boundary: R4P avoids the fragility and scalability issues of test sandboxes that characterize most validation approaches in adjacent categories.

Among 23 candidates examined across three contributions, none were found to clearly refute the paper's claims. The core R4P verifier contribution examined 10 candidates with zero refutable matches, suggesting limited direct prior work on reasoning-based patch verification at this scale. The group-wise training objective examined only 3 candidates, also with no refutations, indicating this training strategy may be relatively novel within the limited search scope. The Mini-SE agent scaffold examined 10 candidates with no refutations, though this may reflect the specific combination of pure RL training with reasoning-based rewards rather than the broader concept of lightweight agent architectures.

Based on the limited search scope of 23 semantically similar papers, the work appears to occupy a relatively unexplored position combining reasoning-based verification with reinforcement learning for agent training. The sparse population of its taxonomy leaf and absence of refuting candidates suggest novelty, though the analysis cannot rule out relevant work outside the top-K semantic matches examined. The 72.2% accuracy claim on SWE-bench-verified and comparison to OpenAI o3 would benefit from broader literature context to assess whether similar verification performance has been reported elsewhere.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Scalable supervision for software engineering agents through patch verification. The field has evolved around the challenge of automatically generating, validating, and deploying code changes at scale. The taxonomy reveals several major branches: Patch Verification and Validation Methods explore how to determine whether generated patches are correct and safe, ranging from test-based approaches to reasoning-based verification systems like Dafny Verification[23]. Agent Architectures and Workflows examine the design patterns and execution strategies that enable autonomous software engineering, while Training and Scaling Methodologies address how to build datasets and learning signals that improve agent performance over time. Parallel branches focus on Vulnerability Detection and Exploitation, Bug Fixing and Automated Repair, and domain-specific applications, reflecting the breadth of tasks that software agents now tackle. Evaluation and Benchmarking provides the measurement infrastructure, and Infrastructure and Tooling supports the practical deployment of these systems. Within this landscape, a particularly active tension exists between test-driven validation—where patches are accepted if they pass existing test suites—and more rigorous verification approaches that reason about correctness guarantees. Patch Reasoner[0] sits squarely in the reasoning-based verification cluster, emphasizing formal or semantic analysis to validate patches beyond simple test execution. This contrasts with many works in the Bug Fixing branch that rely primarily on test outcomes, and also differs from vulnerability-focused efforts like CVE Verifiable Exploits[5] that prioritize exploit generation over patch correctness proofs. Nearby, Dafny Verification[23] shares the formal verification emphasis, while works like Faultline[3] and PatchPilot[4] blend localization and repair with lighter-weight validation. The central open question remains how to scale rigorous verification without sacrificing the speed and flexibility that make agent-based repair practical for real-world codebases.

Claimed Contributions

R4P: A reasoning-based patch verifier model for scalable supervision

10 retrieved papers

The authors introduce R4P, a model that verifies software patches through reasoning rather than test execution. It uses a group-wise objective to compare multiple patches against each other, providing dense rewards for stable reinforcement learning training without requiring golden tests, developer patches, agent trajectories, or runtime sandboxes.

10 retrieved papers

Group-wise training objective for patch verification

3 retrieved papers

The authors propose a group-wise training approach where R4P assesses each patch by comparing it against others in a candidate set. This formulation provides mutual contextual information to compensate for the absence of tests, reduces reward hacking risk, and offers denser rewards than binary classification for more stable convergence.

3 retrieved papers

Mini-SE: A lite execution-free agentic scaffold trained with pure RL

10 retrieved papers

The authors develop Mini-SE, a lightweight software engineering agent with issue-resolving code search and edit capabilities. It is trained using pure reinforcement learning supervised entirely by R4P without test execution during rollout, demonstrating the practical utility of reasoning-based verification for scalable agent training.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[23] Dafny as Verification-Aware Intermediate Language for Code Generation PDF

LI Yue-chen, Zetzsche, Stefan, Yuekang Li, Somayyajula, Siva, Stefan Zetzsche, Siva Somayyajula (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

R4P: A reasoning-based patch verifier model for scalable supervision

[1] Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute PDF

Cannot Refute

[11] Evaluating software development agents: Patch patterns, code quality, and issue complexity in real-world github scenarios PDF

Cannot Refute

[14] AutoPatch: Multi-Agent Framework for Patching Real-World CVE Vulnerabilities PDF

Cannot Refute

[46] Demystifying llm-based software engineering agents PDF

Cannot Refute

[47] Swe-debate: Competitive multi-agent debate for software issue resolution PDF

Cannot Refute

[48] Vul-R2: A Reasoning LLM for Automated Vulnerability Repair PDF

Cannot Refute

[49] Training Software Engineering Agents and Verifiers with SWE-Gym PDF

Cannot Refute

[50] Adversarial Reasoning for Repair Based on Inferred Program Intent PDF

Cannot Refute

[51] Agentic ai software engineer: Programming with trust PDF

Cannot Refute

[52] A case study of llm for automated vulnerability repair: Assessing impact of reasoning and patch validation feedback PDF

Cannot Refute

Contribution

Group-wise training objective for patch verification

[43] Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting PDF

Cannot Refute

[44] Keeping authorities" honest or bust" with decentralized witness cosigning PDF

Cannot Refute

[45] CCrepairBench: A High-Fidelity Benchmark and Reinforcement Learning Framework for C++ Compilation Repair PDF

Cannot Refute

Contribution

Mini-SE: A lite execution-free agentic scaffold trained with pure RL

[53] Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution PDF

Cannot Refute

[54] Knowledge-enhanced software refinement: leveraging reinforcement learning for search-based quality engineering PDF

Cannot Refute

[55] Reinforcement learning for machine learning engineering agents PDF

Cannot Refute

[56] VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use PDF

Cannot Refute

[57] From Learning Agents to Agile Software: Reinforcement Learning's Transformative Role in Requirements Engineering PDF

Cannot Refute

[58] Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning PDF

Cannot Refute

[59] Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards PDF

Cannot Refute

[60] A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning PDF

Cannot Refute

[61] Towards concept based software engineering for intelligent agents PDF

Cannot Refute

[62] A DQN-based agent for automatic software refactoring PDF

Cannot Refute

Scalable Supervising Software Agents with Patch Reasoner

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[23] Dafny as Verification-Aware Intermediate Language for Code Generation PDF

Contribution Analysis

R4P: A reasoning-based patch verifier model for scalable supervision

[1] Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute PDF

[11] Evaluating software development agents: Patch patterns, code quality, and issue complexity in real-world github scenarios PDF

[14] AutoPatch: Multi-Agent Framework for Patching Real-World CVE Vulnerabilities PDF

[46] Demystifying llm-based software engineering agents PDF

[47] Swe-debate: Competitive multi-agent debate for software issue resolution PDF

[48] Vul-R2: A Reasoning LLM for Automated Vulnerability Repair PDF

[49] Training Software Engineering Agents and Verifiers with SWE-Gym PDF

[50] Adversarial Reasoning for Repair Based on Inferred Program Intent PDF

[51] Agentic ai software engineer: Programming with trust PDF

[52] A case study of llm for automated vulnerability repair: Assessing impact of reasoning and patch validation feedback PDF

Group-wise training objective for patch verification

[43] Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting PDF

[44] Keeping authorities" honest or bust" with decentralized witness cosigning PDF

[45] CCrepairBench: A High-Fidelity Benchmark and Reinforcement Learning Framework for C++ Compilation Repair PDF

Mini-SE: A lite execution-free agentic scaffold trained with pure RL

[53] Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution PDF

[54] Knowledge-enhanced software refinement: leveraging reinforcement learning for search-based quality engineering PDF

[55] Reinforcement learning for machine learning engineering agents PDF

[56] VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use PDF

[57] From Learning Agents to Agile Software: Reinforcement Learning's Transformative Role in Requirements Engineering PDF

[58] Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning PDF

[59] Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards PDF

[60] A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning PDF

[61] Towards concept based software engineering for intelligent agents PDF

[62] A DQN-based agent for automatic software refactoring PDF

Table of Contents