EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits
Overview
Overall Novelty Assessment
The paper introduces EditBench, a benchmark for evaluating LLM code editing capabilities using real-world user instructions and code contexts. It resides in the 'Real-World Code Editing Evaluation' leaf, which contains only three papers including this one. This represents a relatively sparse research direction within the broader taxonomy of 50 papers across 30 topics, suggesting that authentic, production-grounded code editing benchmarks remain an emerging area despite the maturity of adjacent fields like general code generation and synthetic dataset creation.
The taxonomy reveals that EditBench's immediate neighbors include 'Synthetic Dataset Generation for Code Editing' (two papers) and broader categories like 'Prompting and Instruction Strategies' (five papers across three sub-leaves) and 'Instruction Tuning for Code Editing' (eight papers across three sub-leaves). While the field has invested heavily in training methods and prompt engineering, the scarcity of real-world evaluation frameworks indicates a gap between model development and ecologically valid assessment. EditBench bridges this gap by emphasizing authentic developer interactions rather than commit-based or artificially constructed tasks.
Among the three contributions analyzed, the core EditBench benchmark examined ten candidates with zero refutations, suggesting novelty in its specific design. The VS Code extension for data collection examined ten candidates and found one refutable prior work, indicating some overlap with existing data collection tools. The context-dependent evaluation mechanism examined only two candidates with no refutations, though the limited search scope (22 total candidates across all contributions) means this analysis captures a snapshot rather than exhaustive coverage of the literature.
Based on the top-22 semantic matches examined, EditBench appears to occupy a relatively novel position within a sparse research direction. The benchmark's emphasis on real-world instructions, contextual information (highlighted code, cursor position), and diverse use cases distinguishes it from sibling papers in the same leaf, though the limited search scope means additional related work may exist beyond the candidates analyzed here.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present EditBench, a benchmark comprising 545 problems sourced from real-world developer interactions. It evaluates LLMs on instructed code editing tasks using authentic user instructions, code contexts, highlighted code segments, and cursor positions across multiple natural and programming languages.
The authors built a VS Code extension that mimics existing code editing tools to gather live, in-the-wild data from nearly 500 users. This extension collects user-written instructions, associated code context, and user votes between model responses during real coding workflows.
The benchmark uniquely incorporates multiple contextual elements beyond the user instruction, including the full code file, highlighted code regions, and cursor position. This makes EditBench the first benchmark to evaluate instructed code edits with this combination of features.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[35] CodeEditorBench: Evaluating code editing capability of LLMs PDF
[47] Codeeditorbench: Evaluating code editing capability of large language models PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
EditBench benchmark for real-world instructed code editing
The authors present EditBench, a benchmark comprising 545 problems sourced from real-world developer interactions. It evaluates LLMs on instructed code editing tasks using authentic user instructions, code contexts, highlighted code segments, and cursor positions across multiple natural and programming languages.
[47] Codeeditorbench: Evaluating code editing capability of large language models PDF
[51] Refactorbench: Evaluating stateful reasoning in language agents through code PDF
[52] Editeval: An instruction-based benchmark for text improvements PDF
[53] A study of update request comments in Stack Overflow answer posts PDF
[54] Large Language Models of Code Fail at Completing Code with Potential Bugs PDF
[55] Opportunities and challenges in repeated revisions to pull-requests: An empirical study PDF
[56] Automated recommendation of software refactorings based on feature requests PDF
[57] Feature requests-based recommendation of software refactorings PDF
[58] SVGEditBench V2: A Benchmark for Instruction-based SVG Editing PDF
[59] VectorEdits: A Dataset and Benchmark for Instruction-Based Editing of Vector Graphics PDF
VS Code extension for in-the-wild data collection
The authors built a VS Code extension that mimics existing code editing tools to gather live, in-the-wild data from nearly 500 users. This extension collects user-written instructions, associated code context, and user votes between model responses during real coding workflows.
[64] In-ide code generation from natural language: Promise and challenges PDF
[62] CodeWatcher: IDE Telemetry Data Extraction Tool for Understanding Coding Interactions with LLMs PDF
[63] Mind the Metrics: Patterns for Telemetry-Aware In-IDE AI Application Development using the Model Context Protocol (MCP) PDF
[65] KOALA: a Configurable Tool for Collecting IDE Data When Solving Programming Tasks PDF
[66] A Study on Developer Behaviors for Validating and Repairing LLM-Generated Code Using Eye Tracking and IDE Actions PDF
[67] Developer Behaviors in Validating and Repairing LLM-Generated Code Using IDE and Eye Tracking PDF
[68] CognitIDE: An IDE Plugin for Mapping Physiological Measurements to Source Code PDF
[69] Enhancing Incremental Dataflow Analysis in an IDE PDF
[70] How far are AI-powered programming assistants from meeting developers' needs? PDF
[71] AntiCopyPaster: An Open-Source Ecosystem for Just-in-time Code Duplicates Extraction PDF
Context-dependent evaluation with highlighted code and cursor position
The benchmark uniquely incorporates multiple contextual elements beyond the user instruction, including the full code file, highlighted code regions, and cursor position. This makes EditBench the first benchmark to evaluate instructed code edits with this combination of features.