EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.5 Download Report PDF

codereal-worldllmcode editedit

Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EditBench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage, i.e.,~user instructions and code contexts collected in the wild. EditBench comprises of 545 problems, multiple natural and programming languages, and a diverse set of real-world use cases, ranging from resolving errors to adding features. EditBench introduces context-dependent problems that require the model to understand code context, highlighted code, and cursor position in addition to the user instruction. We evaluate 40 diverse LLMs and observe that EditBench is a challenging set of problems where only 3 models score over 60%. We find that model performance varies across different categories of user instructions. Further, we find that varying levels of contextual information greatly affect task success rate, with performance varying up to 11%, indicating the importance of evaluating with realistic context.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces EditBench, a benchmark for evaluating LLM code editing capabilities using real-world user instructions and code contexts. It resides in the 'Real-World Code Editing Evaluation' leaf, which contains only three papers including this one. This represents a relatively sparse research direction within the broader taxonomy of 50 papers across 30 topics, suggesting that authentic, production-grounded code editing benchmarks remain an emerging area despite the maturity of adjacent fields like general code generation and synthetic dataset creation.

The taxonomy reveals that EditBench's immediate neighbors include 'Synthetic Dataset Generation for Code Editing' (two papers) and broader categories like 'Prompting and Instruction Strategies' (five papers across three sub-leaves) and 'Instruction Tuning for Code Editing' (eight papers across three sub-leaves). While the field has invested heavily in training methods and prompt engineering, the scarcity of real-world evaluation frameworks indicates a gap between model development and ecologically valid assessment. EditBench bridges this gap by emphasizing authentic developer interactions rather than commit-based or artificially constructed tasks.

Among the three contributions analyzed, the core EditBench benchmark examined ten candidates with zero refutations, suggesting novelty in its specific design. The VS Code extension for data collection examined ten candidates and found one refutable prior work, indicating some overlap with existing data collection tools. The context-dependent evaluation mechanism examined only two candidates with no refutations, though the limited search scope (22 total candidates across all contributions) means this analysis captures a snapshot rather than exhaustive coverage of the literature.

Based on the top-22 semantic matches examined, EditBench appears to occupy a relatively novel position within a sparse research direction. The benchmark's emphasis on real-world instructions, contextual information (highlighted code, cursor position), and diverse use cases distinguishes it from sibling papers in the same leaf, though the limited search scope means additional related work may exist beyond the candidates analyzed here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: instructed code editing with large language models. The field has organized itself around several complementary dimensions. At the highest level, researchers have developed Code Editing Frameworks and Benchmarks to measure real-world performance, alongside Prompting and Instruction Strategies that explore how to elicit precise edits from LLMs. Instruction Tuning for Code Editing focuses on adapting models through supervised fine-tuning, while Iterative Refinement and Feedback Mechanisms investigate multi-turn correction loops. Parallel branches address Code Generation and Optimization (often overlapping with editing when models rewrite for efficiency), Domain-Specific Code Editing (targeting specialized languages or contexts), and Security and Vulnerability Repair. Additional threads examine Developer-LLM Interaction Studies to understand human workflows, Code Editing Architectures that propose novel model designs, Multimodal and Cross-Domain Editing for tasks beyond pure text, Educational Applications, and Code Quality and Correctness Analysis to verify outputs. Within this landscape, a particularly active line of work centers on building realistic benchmarks that capture the complexity of professional code modification. EditBench[0] exemplifies this direction by evaluating LLMs on authentic editing scenarios drawn from real repositories, emphasizing localized changes rather than generation from scratch. This approach contrasts with earlier efforts like CodeEditorBench[35] and CodeEditorBench[47], which also target practical editing but may differ in dataset construction or task granularity. Meanwhile, works such as Executable Code Actions[6] and CursorCore[45] explore how to integrate editing into interactive development environments, and studies like Developer LLM Conversations[7] examine the back-and-forth between programmers and assistants. EditBench[0] sits squarely in the real-world evaluation cluster, sharing its emphasis on ecological validity with these neighboring benchmarks while contributing a distinct set of tasks and metrics that highlight the nuances of instruction-following during localized code modifications.

Claimed Contributions

EditBench benchmark for real-world instructed code editing

10 retrieved papers

The authors present EditBench, a benchmark comprising 545 problems sourced from real-world developer interactions. It evaluates LLMs on instructed code editing tasks using authentic user instructions, code contexts, highlighted code segments, and cursor positions across multiple natural and programming languages.

10 retrieved papers

VS Code extension for in-the-wild data collection

Can Refute

10 retrieved papers

The authors built a VS Code extension that mimics existing code editing tools to gather live, in-the-wild data from nearly 500 users. This extension collects user-written instructions, associated code context, and user votes between model responses during real coding workflows.

10 retrieved papers

Can Refute

Context-dependent evaluation with highlighted code and cursor position

2 retrieved papers

The benchmark uniquely incorporates multiple contextual elements beyond the user instruction, including the full code file, highlighted code regions, and cursor position. This makes EditBench the first benchmark to evaluate instructed code edits with this combination of features.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[35] CodeEditorBench: Evaluating code editing capability of LLMs PDF

J Guo, Z Li, X Liu, K Ma, T Zheng, Z Yu (2025)

[47] Codeeditorbench: Evaluating code editing capability of large language models PDF

Guo Jiawei, Jiawei Guo, Li ZiMing, Ziming Li, Liu Xue-ling, Xueling Liu, Kaijing Ma, Zheng Tianyu, Tianyu Zheng, Yu, Zhouliang, Zhouliang Yu, Pan, Ding, Ding Pan, Li Yi-Zhi, Yizhi Li, Liu, Ruibo, Ruibo Liu, Wang, Yue, Yue Wang, Guo Shuyue, Shuyue Guo, Qu Xingwei, Xingwei Qu, Yue Xiang, Xiang Yue, Zhang Ge, Ge Zhang, Chen Wenhu, Wenhu Chen, Fu Jie, Jie Fu (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EditBench benchmark for real-world instructed code editing

[47] Codeeditorbench: Evaluating code editing capability of large language models PDF

Cannot Refute

[51] Refactorbench: Evaluating stateful reasoning in language agents through code PDF

Cannot Refute

[52] Editeval: An instruction-based benchmark for text improvements PDF

Cannot Refute

[53] A study of update request comments in Stack Overflow answer posts PDF

Cannot Refute

[54] Large Language Models of Code Fail at Completing Code with Potential Bugs PDF

Cannot Refute

[55] Opportunities and challenges in repeated revisions to pull-requests: An empirical study PDF

Cannot Refute

[56] Automated recommendation of software refactorings based on feature requests PDF

Cannot Refute

[57] Feature requests-based recommendation of software refactorings PDF

Cannot Refute

[58] SVGEditBench V2: A Benchmark for Instruction-based SVG Editing PDF

Cannot Refute

[59] VectorEdits: A Dataset and Benchmark for Instruction-Based Editing of Vector Graphics PDF

Cannot Refute

Contribution

VS Code extension for in-the-wild data collection

[64] In-ide code generation from natural language: Promise and challenges PDF

Can Refute

[62] CodeWatcher: IDE Telemetry Data Extraction Tool for Understanding Coding Interactions with LLMs PDF

Cannot Refute

[63] Mind the Metrics: Patterns for Telemetry-Aware In-IDE AI Application Development using the Model Context Protocol (MCP) PDF

Cannot Refute

[65] KOALA: a Configurable Tool for Collecting IDE Data When Solving Programming Tasks PDF

Cannot Refute

[66] A Study on Developer Behaviors for Validating and Repairing LLM-Generated Code Using Eye Tracking and IDE Actions PDF

Cannot Refute

[67] Developer Behaviors in Validating and Repairing LLM-Generated Code Using IDE and Eye Tracking PDF

Cannot Refute

[68] CognitIDE: An IDE Plugin for Mapping Physiological Measurements to Source Code PDF

Cannot Refute

[69] Enhancing Incremental Dataflow Analysis in an IDE PDF

Cannot Refute

[70] How far are AI-powered programming assistants from meeting developers' needs? PDF

Cannot Refute

[71] AntiCopyPaster: An Open-Source Ecosystem for Just-in-time Code Duplicates Extraction PDF

Cannot Refute

Contribution

Context-dependent evaluation with highlighted code and cursor position

[60] Eyes on code: A study on developers' code navigation strategies PDF

Cannot Refute

[61] Eliph: Effective visualization of code history for peer assessment in programming education PDF

Cannot Refute

EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[35] CodeEditorBench: Evaluating code editing capability of LLMs PDF

[47] Codeeditorbench: Evaluating code editing capability of large language models PDF

Contribution Analysis

EditBench benchmark for real-world instructed code editing

[47] Codeeditorbench: Evaluating code editing capability of large language models PDF

[51] Refactorbench: Evaluating stateful reasoning in language agents through code PDF

[52] Editeval: An instruction-based benchmark for text improvements PDF

[53] A study of update request comments in Stack Overflow answer posts PDF

[54] Large Language Models of Code Fail at Completing Code with Potential Bugs PDF

[55] Opportunities and challenges in repeated revisions to pull-requests: An empirical study PDF

[56] Automated recommendation of software refactorings based on feature requests PDF

[57] Feature requests-based recommendation of software refactorings PDF

[58] SVGEditBench V2: A Benchmark for Instruction-based SVG Editing PDF

[59] VectorEdits: A Dataset and Benchmark for Instruction-Based Editing of Vector Graphics PDF

VS Code extension for in-the-wild data collection

[64] In-ide code generation from natural language: Promise and challenges PDF

[62] CodeWatcher: IDE Telemetry Data Extraction Tool for Understanding Coding Interactions with LLMs PDF

[63] Mind the Metrics: Patterns for Telemetry-Aware In-IDE AI Application Development using the Model Context Protocol (MCP) PDF

[65] KOALA: a Configurable Tool for Collecting IDE Data When Solving Programming Tasks PDF

[66] A Study on Developer Behaviors for Validating and Repairing LLM-Generated Code Using Eye Tracking and IDE Actions PDF

[67] Developer Behaviors in Validating and Repairing LLM-Generated Code Using IDE and Eye Tracking PDF

[68] CognitIDE: An IDE Plugin for Mapping Physiological Measurements to Source Code PDF

[69] Enhancing Incremental Dataflow Analysis in an IDE PDF

[70] How far are AI-powered programming assistants from meeting developers' needs? PDF

[71] AntiCopyPaster: An Open-Source Ecosystem for Just-in-time Code Duplicates Extraction PDF

Context-dependent evaluation with highlighted code and cursor position

[60] Eyes on code: A study on developers' code navigation strategies PDF

[61] Eliph: Effective visualization of code history for peer assessment in programming education PDF

Table of Contents