Abstract:

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 1.5: a carefully curated hard benchmark composed of 74 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 50% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Terminal-Bench 1.5, a benchmark of 74 real-world command-line tasks with unique environments, human-written solutions, and comprehensive verification tests. It sits within the Real-World Task Benchmarks leaf of the taxonomy, which contains only two papers total. This is a relatively sparse research direction compared to more crowded areas like General-Purpose Agent Development Platforms (3 papers) or Domain-Specific Agent Frameworks (5 papers), suggesting that authentic, long-horizon CLI task evaluation remains an underexplored niche despite growing interest in agent capabilities.

The benchmark occupies a position adjacent to several related directions. Its sibling paper in the same leaf addresses similar real-world task evaluation concerns, while nearby leaves include Safety and Risk Assessment Benchmarks (focused on security rather than task completion) and Domain-Specific Benchmarks (targeting specialized domains like cybersecurity or self-replication). The taxonomy's scope note for Real-World Task Benchmarks explicitly excludes synthetic or domain-specific work, positioning Terminal-Bench as part of a broader effort to ground agent evaluation in authentic workflows rather than controlled or narrow scenarios.

Among the 30 candidates examined through semantic search, none were found to clearly refute any of the three contributions: the evaluation framework, the dataset, or the Terminus 2 agent scaffold. Each contribution was assessed against 10 candidates with zero refutable overlaps identified. This suggests that within the limited search scope, the specific combination of 74 curated tasks, comprehensive verification tests, and the accompanying agent implementation appears relatively distinct. However, the analysis does not claim exhaustive coverage of all prior benchmarking work in CLI or software engineering domains.

Based on the limited literature search, the work appears to occupy a meaningful position in a sparse but growing research area. The taxonomy structure reveals that while agent frameworks and translation systems are well-populated, benchmarks emphasizing real-world task authenticity remain less common. The absence of clear refutations among examined candidates suggests novelty in the specific task curation and evaluation approach, though the scope of 30 candidates examined leaves open the possibility of relevant prior work beyond the search radius.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: benchmarking AI agents on command line interface tasks. The field has evolved into a structured landscape with several major branches. Agent Frameworks and Development Platforms (e.g., OpenHands[1], OpenDevin[8]) provide the foundational infrastructure for building and deploying CLI-capable agents, while Benchmarking Methodologies and Evaluation Frameworks establish systematic ways to measure agent performance on real-world tasks. Natural Language to Command Translation addresses the core challenge of converting user intent into executable shell commands, a problem explored by works like Natural Language Bash[12] and CLI Command Generation[16]. Meanwhile, Command-Line Data Analysis and Security focuses on understanding CLI usage patterns and detecting malicious behavior, and Application Domains and Use Cases demonstrate how these agents are applied in areas ranging from cloud infrastructure management to cybersecurity orchestration. Supporting Tools and Infrastructure provide the scaffolding needed for agent development, while Theoretical and Conceptual Foundations explore the underlying principles of human-computer interaction and agent design. Within the Benchmarking Methodologies branch, a particularly active line of work centers on Real-World Task Benchmarks that evaluate agents on authentic, complex scenarios rather than synthetic exercises. Terminal-Bench[0] contributes to this direction by providing a benchmark grounded in actual command-line workflows, emphasizing ecological validity and practical task completion. This approach contrasts with earlier interactive coding environments like Intercode[2], which focused more narrowly on code execution feedback loops, and complements recent efforts such as BountyBench[4], which targets software engineering tasks with real-world complexity. The tension between controlled, reproducible evaluation and the messiness of genuine CLI usage remains a central challenge, as does the question of how to balance task diversity with meaningful performance metrics across different agent architectures and application contexts.

Claimed Contributions

Terminal-Bench evaluation framework

The authors introduce Terminal-Bench, a framework for evaluating AI agents on realistic tasks in command line interfaces. Each task consists of a containerized environment, instructions, verification tests, and reference solutions, enabling outcome-driven evaluation of agents performing high-skill technical work.

10 retrieved papers
Terminal-Bench 2.0 dataset

The authors present a curated dataset of 89 challenging tasks requiring extensive domain knowledge and long chains of interdependent actions. These tasks were crowd-sourced from 93 contributors and underwent rigorous verification including automated checks, manual reviews totaling approximately three hours per task, and adversarial testing.

10 retrieved papers
Terminus 2 agent scaffold

The authors develop Terminus 2, a minimal agent scaffold with a single headless terminal tool that completes tasks using only Bash commands. This provides a neutral baseline for evaluating model performance independent of agent-specific engineering optimizations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Terminal-Bench evaluation framework

The authors introduce Terminal-Bench, a framework for evaluating AI agents on realistic tasks in command line interfaces. Each task consists of a containerized environment, instructions, verification tests, and reference solutions, enabling outcome-driven evaluation of agents performing high-skill technical work.

Contribution

Terminal-Bench 2.0 dataset

The authors present a curated dataset of 89 challenging tasks requiring extensive domain knowledge and long chains of interdependent actions. These tasks were crowd-sourced from 93 contributors and underwent rigorous verification including automated checks, manual reviews totaling approximately three hours per task, and adversarial testing.

Contribution

Terminus 2 agent scaffold

The authors develop Terminus 2, a minimal agent scaffold with a single headless terminal tool that completes tasks using only Bash commands. This provides a neutral baseline for evaluating model performance independent of agent-specific engineering optimizations.