Abstract:

Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EditBench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage, i.e.,~user instructions and code contexts collected in the wild. EditBench comprises of 545 problems, multiple natural and programming languages, and a diverse set of real-world use cases, ranging from resolving errors to adding features. EditBench introduces context-dependent problems that require the model to understand code context, highlighted code, and cursor position in addition to the user instruction. We evaluate 40 diverse LLMs and observe that EditBench is a challenging set of problems where only 3 models score over 60%. We find that model performance varies across different categories of user instructions. Further, we find that varying levels of contextual information greatly affect task success rate, with performance varying up to 11%, indicating the importance of evaluating with realistic context.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces EditBench, a benchmark for evaluating LLM code editing capabilities using real-world user instructions and code contexts. It resides in the 'Real-World Code Editing Evaluation' leaf, which contains only three papers including this one. This represents a relatively sparse research direction within the broader taxonomy of 50 papers across 30 topics, suggesting that authentic, production-grounded code editing benchmarks remain an emerging area despite the maturity of adjacent fields like general code generation and synthetic dataset creation.

The taxonomy reveals that EditBench's immediate neighbors include 'Synthetic Dataset Generation for Code Editing' (two papers) and broader categories like 'Prompting and Instruction Strategies' (five papers across three sub-leaves) and 'Instruction Tuning for Code Editing' (eight papers across three sub-leaves). While the field has invested heavily in training methods and prompt engineering, the scarcity of real-world evaluation frameworks indicates a gap between model development and ecologically valid assessment. EditBench bridges this gap by emphasizing authentic developer interactions rather than commit-based or artificially constructed tasks.

Among the three contributions analyzed, the core EditBench benchmark examined ten candidates with zero refutations, suggesting novelty in its specific design. The VS Code extension for data collection examined ten candidates and found one refutable prior work, indicating some overlap with existing data collection tools. The context-dependent evaluation mechanism examined only two candidates with no refutations, though the limited search scope (22 total candidates across all contributions) means this analysis captures a snapshot rather than exhaustive coverage of the literature.

Based on the top-22 semantic matches examined, EditBench appears to occupy a relatively novel position within a sparse research direction. The benchmark's emphasis on real-world instructions, contextual information (highlighted code, cursor position), and diverse use cases distinguishes it from sibling papers in the same leaf, though the limited search scope means additional related work may exist beyond the candidates analyzed here.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: instructed code editing with large language models. The field has organized itself around several complementary dimensions. At the highest level, researchers have developed Code Editing Frameworks and Benchmarks to measure real-world performance, alongside Prompting and Instruction Strategies that explore how to elicit precise edits from LLMs. Instruction Tuning for Code Editing focuses on adapting models through supervised fine-tuning, while Iterative Refinement and Feedback Mechanisms investigate multi-turn correction loops. Parallel branches address Code Generation and Optimization (often overlapping with editing when models rewrite for efficiency), Domain-Specific Code Editing (targeting specialized languages or contexts), and Security and Vulnerability Repair. Additional threads examine Developer-LLM Interaction Studies to understand human workflows, Code Editing Architectures that propose novel model designs, Multimodal and Cross-Domain Editing for tasks beyond pure text, Educational Applications, and Code Quality and Correctness Analysis to verify outputs. Within this landscape, a particularly active line of work centers on building realistic benchmarks that capture the complexity of professional code modification. EditBench[0] exemplifies this direction by evaluating LLMs on authentic editing scenarios drawn from real repositories, emphasizing localized changes rather than generation from scratch. This approach contrasts with earlier efforts like CodeEditorBench[35] and CodeEditorBench[47], which also target practical editing but may differ in dataset construction or task granularity. Meanwhile, works such as Executable Code Actions[6] and CursorCore[45] explore how to integrate editing into interactive development environments, and studies like Developer LLM Conversations[7] examine the back-and-forth between programmers and assistants. EditBench[0] sits squarely in the real-world evaluation cluster, sharing its emphasis on ecological validity with these neighboring benchmarks while contributing a distinct set of tasks and metrics that highlight the nuances of instruction-following during localized code modifications.

Claimed Contributions

EditBench benchmark for real-world instructed code editing

The authors present EditBench, a benchmark comprising 545 problems sourced from real-world developer interactions. It evaluates LLMs on instructed code editing tasks using authentic user instructions, code contexts, highlighted code segments, and cursor positions across multiple natural and programming languages.

10 retrieved papers
VS Code extension for in-the-wild data collection

The authors built a VS Code extension that mimics existing code editing tools to gather live, in-the-wild data from nearly 500 users. This extension collects user-written instructions, associated code context, and user votes between model responses during real coding workflows.

10 retrieved papers
Can Refute
Context-dependent evaluation with highlighted code and cursor position

The benchmark uniquely incorporates multiple contextual elements beyond the user instruction, including the full code file, highlighted code regions, and cursor position. This makes EditBench the first benchmark to evaluate instructed code edits with this combination of features.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EditBench benchmark for real-world instructed code editing

The authors present EditBench, a benchmark comprising 545 problems sourced from real-world developer interactions. It evaluates LLMs on instructed code editing tasks using authentic user instructions, code contexts, highlighted code segments, and cursor positions across multiple natural and programming languages.

Contribution

VS Code extension for in-the-wild data collection

The authors built a VS Code extension that mimics existing code editing tools to gather live, in-the-wild data from nearly 500 users. This extension collects user-written instructions, associated code context, and user votes between model responses during real coding workflows.

Contribution

Context-dependent evaluation with highlighted code and cursor position

The benchmark uniquely incorporates multiple contextual elements beyond the user instruction, including the full code file, highlighted code regions, and cursor position. This makes EditBench the first benchmark to evaluate instructed code edits with this combination of features.

EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits | Novelty Validation