Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.3 Download Report PDF

benchmarkdatasetagents

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 1.5: a carefully curated hard benchmark composed of 74 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 50% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Terminal-Bench 1.5, a benchmark of 74 real-world command-line tasks with unique environments, human-written solutions, and comprehensive verification tests. It sits within the Real-World Task Benchmarks leaf of the taxonomy, which contains only two papers total. This is a relatively sparse research direction compared to more crowded areas like General-Purpose Agent Development Platforms (3 papers) or Domain-Specific Agent Frameworks (5 papers), suggesting that authentic, long-horizon CLI task evaluation remains an underexplored niche despite growing interest in agent capabilities.

The benchmark occupies a position adjacent to several related directions. Its sibling paper in the same leaf addresses similar real-world task evaluation concerns, while nearby leaves include Safety and Risk Assessment Benchmarks (focused on security rather than task completion) and Domain-Specific Benchmarks (targeting specialized domains like cybersecurity or self-replication). The taxonomy's scope note for Real-World Task Benchmarks explicitly excludes synthetic or domain-specific work, positioning Terminal-Bench as part of a broader effort to ground agent evaluation in authentic workflows rather than controlled or narrow scenarios.

Among the 30 candidates examined through semantic search, none were found to clearly refute any of the three contributions: the evaluation framework, the dataset, or the Terminus 2 agent scaffold. Each contribution was assessed against 10 candidates with zero refutable overlaps identified. This suggests that within the limited search scope, the specific combination of 74 curated tasks, comprehensive verification tests, and the accompanying agent implementation appears relatively distinct. However, the analysis does not claim exhaustive coverage of all prior benchmarking work in CLI or software engineering domains.

Based on the limited literature search, the work appears to occupy a meaningful position in a sparse but growing research area. The taxonomy structure reveals that while agent frameworks and translation systems are well-populated, benchmarks emphasizing real-world task authenticity remain less common. The absence of clear refutations among examined candidates suggests novelty in the specific task curation and evaluation approach, though the scope of 30 candidates examined leaves open the possibility of relevant prior work beyond the search radius.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: benchmarking AI agents on command line interface tasks. The field has evolved into a structured landscape with several major branches. Agent Frameworks and Development Platforms (e.g., OpenHands[1], OpenDevin[8]) provide the foundational infrastructure for building and deploying CLI-capable agents, while Benchmarking Methodologies and Evaluation Frameworks establish systematic ways to measure agent performance on real-world tasks. Natural Language to Command Translation addresses the core challenge of converting user intent into executable shell commands, a problem explored by works like Natural Language Bash[12] and CLI Command Generation[16]. Meanwhile, Command-Line Data Analysis and Security focuses on understanding CLI usage patterns and detecting malicious behavior, and Application Domains and Use Cases demonstrate how these agents are applied in areas ranging from cloud infrastructure management to cybersecurity orchestration. Supporting Tools and Infrastructure provide the scaffolding needed for agent development, while Theoretical and Conceptual Foundations explore the underlying principles of human-computer interaction and agent design. Within the Benchmarking Methodologies branch, a particularly active line of work centers on Real-World Task Benchmarks that evaluate agents on authentic, complex scenarios rather than synthetic exercises. Terminal-Bench[0] contributes to this direction by providing a benchmark grounded in actual command-line workflows, emphasizing ecological validity and practical task completion. This approach contrasts with earlier interactive coding environments like Intercode[2], which focused more narrowly on code execution feedback loops, and complements recent efforts such as BountyBench[4], which targets software engineering tasks with real-world complexity. The tension between controlled, reproducible evaluation and the messiness of genuine CLI usage remains a central challenge, as does the question of how to balance task diversity with meaningful performance metrics across different agent architectures and application contexts.

Claimed Contributions

Terminal-Bench evaluation framework

10 retrieved papers

The authors introduce Terminal-Bench, a framework for evaluating AI agents on realistic tasks in command line interfaces. Each task consists of a containerized environment, instructions, verification tests, and reference solutions, enabling outcome-driven evaluation of agents performing high-skill technical work.

10 retrieved papers

Terminal-Bench 2.0 dataset

10 retrieved papers

The authors present a curated dataset of 89 challenging tasks requiring extensive domain knowledge and long chains of interdependent actions. These tasks were crowd-sourced from 93 contributors and underwent rigorous verification including automated checks, manual reviews totaling approximately three hours per task, and adversarial testing.

10 retrieved papers

Terminus 2 agent scaffold

10 retrieved papers

The authors develop Terminus 2, a minimal agent scaffold with a single headless terminal tool that completes tasks using only Bash commands. This provides a neutral baseline for evaluating model performance independent of agent-specific engineering optimizations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems PDF

Wu Jun-rong, Li Jiliang, Tran Jason, Kim Seung-Woo, Li Ryan, Xu Weiran, Bao, Yuxuan, Song, Dawn, Boneh, Dan, Ho, Daniel E., Liang, Percy (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Terminal-Bench evaluation framework

[1] OpenHands: An Open Platform for AI Software Developers as Generalist Agents PDF

Cannot Refute

[2] Intercode: Standardizing and benchmarking interactive coding with execution feedback PDF

Cannot Refute

[51] Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety PDF

Cannot Refute

[52] Ctrl-Z: Controlling AI agents via resampling PDF

Cannot Refute

[53] LuckyMera: a modular AI framework for building hybrid NetHack agents PDF

Cannot Refute

[54] SimulBench: Evaluating Language Models with Creative Simulation Tasks PDF

Cannot Refute

[55] Terminal evaluation system based on digital twin and its application PDF

Cannot Refute

[56] CLAI: A Platform for AI Skills on the Command Line PDF

Cannot Refute

[57] Research on AI Agent-Based Method for Automated Terminal Testing PDF

Cannot Refute

[58] Agentic AI for Penetration Testing PDF

Cannot Refute

Contribution

Terminal-Bench 2.0 dataset

[59] Agieval: A human-centric benchmark for evaluating foundation models PDF

Cannot Refute

[60] MCoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought PDF

Cannot Refute

[61] Toolvqa: A dataset for multi-step reasoning vqa with external tools PDF

Cannot Refute

[62] Complexity-based prompting for multi-step reasoning PDF

Cannot Refute

[63] M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought PDF

Cannot Refute

[64] FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering PDF

Cannot Refute

[65] Webarena: A realistic web environment for building autonomous agents PDF

Cannot Refute

[66] Stepgame: A new benchmark for robust multi-hop spatial reasoning in texts PDF

Cannot Refute

[67] Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue PDF

Cannot Refute

[68] MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark PDF

Cannot Refute

Contribution

Terminus 2 agent scaffold

[69] A survey on large language model based autonomous agents PDF

Cannot Refute

[70] Smart-llm: Smart multi-agent robot task planning using large language models PDF

Cannot Refute

[71] Chain of agents: Large language models collaborating on long-context tasks PDF

Cannot Refute

[72] Building cooperative embodied agents modularly with large language models PDF

Cannot Refute

[73] VideoAgent: Long-form Video Understanding with Large Language Model as Agent PDF

Cannot Refute

[74] Agents: An open-source framework for autonomous language agents PDF

Cannot Refute

[75] Agentsims: An open-source sandbox for large language model evaluation PDF

Cannot Refute

[76] Toward efficient exploration by large language model agents PDF

Cannot Refute

[77] Mlagentbench: Evaluating language agents on machine learning experimentation PDF

Cannot Refute

[78] Language agent tree search unifies reasoning acting and planning in language models PDF

Cannot Refute

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems PDF

Contribution Analysis

Terminal-Bench evaluation framework

[1] OpenHands: An Open Platform for AI Software Developers as Generalist Agents PDF

[2] Intercode: Standardizing and benchmarking interactive coding with execution feedback PDF

[51] Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety PDF

[52] Ctrl-Z: Controlling AI agents via resampling PDF

[53] LuckyMera: a modular AI framework for building hybrid NetHack agents PDF

[54] SimulBench: Evaluating Language Models with Creative Simulation Tasks PDF

[55] Terminal evaluation system based on digital twin and its application PDF

[56] CLAI: A Platform for AI Skills on the Command Line PDF

[57] Research on AI Agent-Based Method for Automated Terminal Testing PDF

[58] Agentic AI for Penetration Testing PDF

Terminal-Bench 2.0 dataset

[59] Agieval: A human-centric benchmark for evaluating foundation models PDF

[60] MCoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought PDF

[61] Toolvqa: A dataset for multi-step reasoning vqa with external tools PDF

[62] Complexity-based prompting for multi-step reasoning PDF

[63] M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought PDF

[64] FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering PDF

[65] Webarena: A realistic web environment for building autonomous agents PDF

[66] Stepgame: A new benchmark for robust multi-hop spatial reasoning in texts PDF

[67] Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue PDF

[68] MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark PDF

Terminus 2 agent scaffold

[69] A survey on large language model based autonomous agents PDF

[70] Smart-llm: Smart multi-agent robot task planning using large language models PDF

[71] Chain of agents: Large language models collaborating on long-context tasks PDF

[72] Building cooperative embodied agents modularly with large language models PDF

[73] VideoAgent: Long-form Video Understanding with Large Language Model as Agent PDF

[74] Agents: An open-source framework for autonomous language agents PDF

[75] Agentsims: An open-source sandbox for large language model evaluation PDF

[76] Toward efficient exploration by large language model agents PDF

[77] Mlagentbench: Evaluating language agents on machine learning experimentation PDF

[78] Language agent tree search unifies reasoning acting and planning in language models PDF

Table of Contents