MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

ICLR 2026 Conference SubmissionAnonymous Authors
Tool-using Agent; Real-World Tasks; Model Context Protocol
Abstract:

We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input–output coupling. Also, tasks in MCP-Bench test agents’ ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows—capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectorylevel planning and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

MCP-Bench proposes a benchmark for evaluating LLM agents on realistic multi-step tasks requiring tool coordination, parameter control, and planning across 28 live MCP servers spanning 250 tools. It resides in the General-Purpose Agent Benchmarks leaf, which contains four papers including AgentBench and AgentBoard. This leaf sits within the broader Benchmark Design and Evaluation Frameworks branch, indicating a moderately populated research direction focused on holistic agent assessment rather than isolated competency testing.

The taxonomy reveals neighboring leaves addressing complementary evaluation challenges: Tool-Use and Function-Calling Benchmarks (five papers) focus on tool selection mechanics, Multi-Turn Interaction and Long-Horizon Benchmarks (four papers) examine sustained planning, and Domain-Specific Benchmarks (five papers) target application contexts. MCP-Bench bridges these areas by combining multi-step planning with cross-domain tool orchestration, diverging from narrower benchmarks like T-eval that isolate specific tool-use aspects. Its emphasis on fuzzy instructions and cross-tool workflows positions it at the intersection of general-purpose and tool-focused evaluation paradigms.

Among 21 candidates examined, the core MCP-Bench benchmark contribution shows no clear refutation across 10 candidates, suggesting novelty in its integration of live MCP servers and compositional task design. The structured task synthesis pipeline examined one candidate without refutation. However, the multi-faceted evaluation framework combining rule-based and LLM-as-a-Judge scoring encountered one refutable candidate among 10 examined, indicating prior work on hybrid evaluation methodologies. The limited search scope means these findings reflect top-K semantic matches rather than exhaustive coverage.

Based on this constrained analysis, MCP-Bench appears to introduce a distinctive benchmark architecture emphasizing realistic tool ecosystems and fuzzy task specifications, though its evaluation methodology overlaps with existing hybrid scoring approaches. The taxonomy context suggests it occupies a moderately explored niche within general-purpose agent evaluation, with potential differentiation stemming from its MCP-based infrastructure and cross-domain workflow emphasis rather than fundamentally novel assessment principles.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: evaluating tool-using LLM agents on complex multi-step tasks. The field has organized itself around five major branches that reflect different facets of building and assessing capable agents. Benchmark Design and Evaluation Frameworks focuses on creating standardized testbeds and metrics to measure agent performance across diverse scenarios, with works like AgentBench[15] and AgentBoard[8] providing general-purpose evaluation suites. Agent Architectures and System Designs explores how to structure agents—whether as single monolithic systems or multi-agent collaborations—and how they interface with tools and environments. Agent Training and Optimization Methods investigates learning paradigms, from reinforcement learning approaches such as AgentGym-RL[12] to instruction tuning and memory-augmented strategies. Reasoning and Planning Strategies examines the cognitive mechanisms agents use to decompose tasks, select tools, and adapt plans dynamically. Finally, Safety, Reliability, and Risk Management addresses the critical need to ensure agents operate robustly and avoid harmful behaviors, as highlighted by SafeAgent[14]. Recent activity reveals a tension between breadth and depth in evaluation design. Many studies pursue general-purpose benchmarks that test agents across varied domains, while others drill into specific challenges like tool selection efficiency or long-horizon planning under uncertainty. MCP-Bench[0] situates itself within the General-Purpose Agent Benchmarks cluster, emphasizing multi-step task complexity and tool orchestration in a manner similar to AgentBoard[8] and AgentBench[15], yet it appears to place stronger emphasis on compositional reasoning across heterogeneous tool sets. Compared to more narrowly scoped benchmarks like T-eval[6] or ToolHaystack[4], which isolate particular competencies, MCP-Bench[0] adopts a holistic stance that mirrors the integrative philosophy seen in LLM to Autonomous Agents[3]. This positioning reflects an ongoing debate about whether richer, more realistic benchmarks or controlled, diagnostic evaluations better advance our understanding of agent capabilities and failure modes.

Claimed Contributions

MCP-Bench benchmark for realistic multi-step tool-using tasks

The authors present MCP-Bench, a benchmark that connects LLMs to 28 live MCP servers spanning 250 tools across multiple domains. It evaluates agents on complex, multi-hop tasks requiring tool coordination, parameter control, and planning under fuzzy instructions without explicit tool names or execution steps.

10 retrieved papers
Structured task synthesis pipeline with fuzzy instructions

The authors develop an automated pipeline that discovers dependency chains among tools, generates tasks based on these chains, applies quality filtering for solvability and utility, and produces fuzzy task variants that omit explicit operational details while preserving core objectives.

1 retrieved paper
Multi-faceted evaluation framework combining rule-based and LLM-as-a-Judge scoring

The authors propose an evaluation framework that uses rule-based metrics for tool validity, schema compliance, and runtime success, combined with rubric-driven LLM-as-a-Judge scoring for task completion, tool usage, and planning effectiveness, enhanced by prompt shuffling and score averaging for stability.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MCP-Bench benchmark for realistic multi-step tool-using tasks

The authors present MCP-Bench, a benchmark that connects LLMs to 28 live MCP servers spanning 250 tools across multiple domains. It evaluates agents on complex, multi-hop tasks requiring tool coordination, parameter control, and planning under fuzzy instructions without explicit tool names or execution steps.

Contribution

Structured task synthesis pipeline with fuzzy instructions

The authors develop an automated pipeline that discovers dependency chains among tools, generates tasks based on these chains, applies quality filtering for solvability and utility, and produces fuzzy task variants that omit explicit operational details while preserving core objectives.

Contribution

Multi-faceted evaluation framework combining rule-based and LLM-as-a-Judge scoring

The authors propose an evaluation framework that uses rule-based metrics for tool validity, schema compliance, and runtime success, combined with rubric-driven LLM-as-a-Judge scoring for task completion, tool usage, and planning effectiveness, enhanced by prompt shuffling and score averaging for stability.