MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Tool-using Agent; Real-World Tasks; Model Context Protocol

We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input–output coupling. Also, tasks in MCP-Bench test agents’ ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows—capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectorylevel planning and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

MCP-Bench proposes a benchmark for evaluating LLM agents on realistic multi-step tasks requiring tool coordination, parameter control, and planning across 28 live MCP servers spanning 250 tools. It resides in the General-Purpose Agent Benchmarks leaf, which contains four papers including AgentBench and AgentBoard. This leaf sits within the broader Benchmark Design and Evaluation Frameworks branch, indicating a moderately populated research direction focused on holistic agent assessment rather than isolated competency testing.

The taxonomy reveals neighboring leaves addressing complementary evaluation challenges: Tool-Use and Function-Calling Benchmarks (five papers) focus on tool selection mechanics, Multi-Turn Interaction and Long-Horizon Benchmarks (four papers) examine sustained planning, and Domain-Specific Benchmarks (five papers) target application contexts. MCP-Bench bridges these areas by combining multi-step planning with cross-domain tool orchestration, diverging from narrower benchmarks like T-eval that isolate specific tool-use aspects. Its emphasis on fuzzy instructions and cross-tool workflows positions it at the intersection of general-purpose and tool-focused evaluation paradigms.

Among 21 candidates examined, the core MCP-Bench benchmark contribution shows no clear refutation across 10 candidates, suggesting novelty in its integration of live MCP servers and compositional task design. The structured task synthesis pipeline examined one candidate without refutation. However, the multi-faceted evaluation framework combining rule-based and LLM-as-a-Judge scoring encountered one refutable candidate among 10 examined, indicating prior work on hybrid evaluation methodologies. The limited search scope means these findings reflect top-K semantic matches rather than exhaustive coverage.

Based on this constrained analysis, MCP-Bench appears to introduce a distinctive benchmark architecture emphasizing realistic tool ecosystems and fuzzy task specifications, though its evaluation methodology overlaps with existing hybrid scoring approaches. The taxonomy context suggests it occupies a moderately explored niche within general-purpose agent evaluation, with potential differentiation stemming from its MCP-based infrastructure and cross-domain workflow emphasis rather than fundamentally novel assessment principles.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating tool-using LLM agents on complex multi-step tasks. The field has organized itself around five major branches that reflect different facets of building and assessing capable agents. Benchmark Design and Evaluation Frameworks focuses on creating standardized testbeds and metrics to measure agent performance across diverse scenarios, with works like AgentBench[15] and AgentBoard[8] providing general-purpose evaluation suites. Agent Architectures and System Designs explores how to structure agents—whether as single monolithic systems or multi-agent collaborations—and how they interface with tools and environments. Agent Training and Optimization Methods investigates learning paradigms, from reinforcement learning approaches such as AgentGym-RL[12] to instruction tuning and memory-augmented strategies. Reasoning and Planning Strategies examines the cognitive mechanisms agents use to decompose tasks, select tools, and adapt plans dynamically. Finally, Safety, Reliability, and Risk Management addresses the critical need to ensure agents operate robustly and avoid harmful behaviors, as highlighted by SafeAgent[14]. Recent activity reveals a tension between breadth and depth in evaluation design. Many studies pursue general-purpose benchmarks that test agents across varied domains, while others drill into specific challenges like tool selection efficiency or long-horizon planning under uncertainty. MCP-Bench[0] situates itself within the General-Purpose Agent Benchmarks cluster, emphasizing multi-step task complexity and tool orchestration in a manner similar to AgentBoard[8] and AgentBench[15], yet it appears to place stronger emphasis on compositional reasoning across heterogeneous tool sets. Compared to more narrowly scoped benchmarks like T-eval[6] or ToolHaystack[4], which isolate particular competencies, MCP-Bench[0] adopts a holistic stance that mirrors the integrative philosophy seen in LLM to Autonomous Agents[3]. This positioning reflects an ongoing debate about whether richer, more realistic benchmarks or controlled, diagnostic evaluations better advance our understanding of agent capabilities and failure modes.

Claimed Contributions

MCP-Bench benchmark for realistic multi-step tool-using tasks

10 retrieved papers

The authors present MCP-Bench, a benchmark that connects LLMs to 28 live MCP servers spanning 250 tools across multiple domains. It evaluates agents on complex, multi-hop tasks requiring tool coordination, parameter control, and planning under fuzzy instructions without explicit tool names or execution steps.

10 retrieved papers

Structured task synthesis pipeline with fuzzy instructions

1 retrieved paper

The authors develop an automated pipeline that discovers dependency chains among tools, generates tasks based on these chains, applies quality filtering for solvability and utility, and produces fuzzy task variants that omit explicit operational details while preserving core objectives.

1 retrieved paper

Multi-faceted evaluation framework combining rule-based and LLM-as-a-Judge scoring

Can Refute

10 retrieved papers

The authors propose an evaluation framework that uses rule-based metrics for tool validity, schema compliance, and runtime success, combined with rubric-driven LLM-as-a-Judge scoring for task completion, tool usage, and planning effectiveness, enhanced by prompt shuffling and score averaging for stability.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] From llm reasoning to autonomous ai agents: A comprehensive review PDF

Ferrag, Mohamed Amine, Tihanyi, Norbert, Debbah, Merouane (2025)

[8] AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents PDF

Junxian He, Yaohui Jin, Lingpeng Kong, Zhenzhong Lan, Chang Ma, Cheng Yang, Yujiu Yang, Junlei Zhang, Zhihao Zhu (2024)

[15] Agentbench: Evaluating llms as agents PDF

Liu, Xiao, Xiao Liu, Yu Hao, Hao Yu, Zhang Han-chen, Hanchen Zhang, Xu, Yifan, Yifan Xu, Xuanyu Lei, Lai, Hanyu, Hanyu Lai, Gu Yu, YuâCheng Gu, Hangliang Ding, Yu Gu, Men, Kaiwen, Kaiwen Men, Yuxian Gu, Yang, Kejuan, Kejuan Yang, Zhang Shudan, Shudan Zhang, Kai Men, Deng Xiang, Xiang Deng, Zeng, Aohan, Aohan Zeng, Du, Zhengxiao, Zhengxiao Du, Zhang, Chenhui, Chenhui Zhang, Shen Sheng, Sheng Shen, Tianjun, Tianjun Zhang, Su Yu, Yu Su, Shengqi Shen, Sun Huan, Huan Sun, Huang, Minlie, Minlie Huang, Dong, Yuxiao, Yuxiao Dong, Tang, Jie, Jie Tang (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MCP-Bench benchmark for realistic multi-step tool-using tasks

[38] Odysseybench: Evaluating llm agents on long-horizon complex office application workflows PDF

Cannot Refute

[42] API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs PDF

Cannot Refute

[50] Deep research agents: A systematic examination and roadmap PDF

Cannot Refute

[51] Paperarena: An evaluation benchmark for tool-augmented agentic reasoning on scientific literature PDF

Cannot Refute

[52] PlanningArena: A Modular Benchmark for Multidimensional Evaluation of Planning and Tool Learning PDF

Cannot Refute

[53] Mindagent: Emergent gaming interaction PDF

Cannot Refute

[54] COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context PDF

Cannot Refute

[55] GTA: a benchmark for general tool agents PDF

Cannot Refute

[56] Partnr: A benchmark for planning and reasoning in embodied multi-agent tasks PDF

Cannot Refute

[57] API-Bank: A Benchmark for Tool-Augmented LLMs PDF

Cannot Refute

Contribution

Structured task synthesis pipeline with fuzzy instructions

[58] Stark: Benchmarking llm retrieval on textual and relational knowledge bases PDF

Cannot Refute

Contribution

Multi-faceted evaluation framework combining rule-based and LLM-as-a-Judge scoring

[63] Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety PDF

Can Refute

[59] Gtbench: Uncovering the strategic reasoning capabilities of llms via game-theoretic evaluations PDF

Cannot Refute

[60] Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents PDF

Cannot Refute

[61] REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites PDF

Cannot Refute

[62] Agentrewardbench: Evaluating automatic evaluations of web agent trajectories PDF

Cannot Refute

[64] Personagym: Evaluating persona agents and llms PDF

Cannot Refute

[65] MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation PDF

Cannot Refute

[66] Beyond the Surface: Measuring Self-Preference in LLM Judgments PDF

Cannot Refute

[67] LLMs for Customized Marketing Content Generation and Evaluation at Scale PDF

Cannot Refute

[68] FinReflectKG: Agentic Construction and Evaluation of Financial Knowledge Graphs PDF

Cannot Refute

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] From llm reasoning to autonomous ai agents: A comprehensive review PDF

[8] AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents PDF

[15] Agentbench: Evaluating llms as agents PDF

Contribution Analysis

MCP-Bench benchmark for realistic multi-step tool-using tasks

[38] Odysseybench: Evaluating llm agents on long-horizon complex office application workflows PDF

[42] API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs PDF

[50] Deep research agents: A systematic examination and roadmap PDF

[51] Paperarena: An evaluation benchmark for tool-augmented agentic reasoning on scientific literature PDF

[52] PlanningArena: A Modular Benchmark for Multidimensional Evaluation of Planning and Tool Learning PDF

[53] Mindagent: Emergent gaming interaction PDF

[54] COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context PDF

[55] GTA: a benchmark for general tool agents PDF

[56] Partnr: A benchmark for planning and reasoning in embodied multi-agent tasks PDF

[57] API-Bank: A Benchmark for Tool-Augmented LLMs PDF

Structured task synthesis pipeline with fuzzy instructions

[58] Stark: Benchmarking llm retrieval on textual and relational knowledge bases PDF

Multi-faceted evaluation framework combining rule-based and LLM-as-a-Judge scoring

[63] Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety PDF

[59] Gtbench: Uncovering the strategic reasoning capabilities of llms via game-theoretic evaluations PDF

[60] Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents PDF

[61] REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites PDF

[62] Agentrewardbench: Evaluating automatic evaluations of web agent trajectories PDF

[64] Personagym: Evaluating persona agents and llms PDF

[65] MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation PDF

[66] Beyond the Surface: Measuring Self-Preference in LLM Judgments PDF

[67] LLMs for Customized Marketing Content Generation and Evaluation at Scale PDF

[68] FinReflectKG: Agentic Construction and Evaluation of Financial Knowledge Graphs PDF

Table of Contents