LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

ICLR 2026 Conference SubmissionAnonymous Authors
Model Context ProtocolMCP-useBenchmark
Abstract:

Model Context Protocol (MCP) has become a key infrastructure for connecting LLMs with external tools, scaling to 10,000+ MCP servers with diverse tools. Unfortunately, there is still a large gap between real-world MCP usage and current evaluation: they typically assume single-server settings and directly inject tools into the model’s context, bypassing the challenges of large-scale retrieval and multi-tool composition. To bridge this gap, we propose LiveMCPBench, which evaluates 95 real-world daily tasks explicitly constructed to stress diverse tools and scaled multi-server routing. The benchmark includes a ready-to-deploy tool suite of 70 servers with 527 tools, ensuring reproducibility without scattered API configuration. We further introduce an LLM-as-a-Judge evaluation framework that directly verifies task outcomes, handling dynamic data sources and multiple valid solution paths. We benchmark 10 state-of-the-art LLMs and observe a substantial performance gap: while Claude-Sonnet-4 reaches 78.95% task success, most models achieve only 30–50%. Our analysis reveals that active tool composition strongly correlates with task success, whereas retrieval errors account for nearly half of all failures-highlighting retrieval as the dominant bottleneck. Together, these results provide the first large-scale, reproducible diagnosis of MCP agent capabilities and point towards future research on improving retrieval robustness and encouraging effective tool composition. Code and data will be released upon publication.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

LiveMCPBench introduces a benchmark for evaluating LLM agents on 95 real-world daily tasks requiring large-scale tool retrieval and multi-server composition, accompanied by a reproducible suite of 70 servers and 527 tools. The paper sits within the Real-World Task Benchmarks leaf of the taxonomy, which contains six sibling papers addressing authentic multi-step evaluation scenarios. This leaf represents a moderately populated research direction within the broader Benchmarking and Evaluation Frameworks branch, indicating active but not overcrowded interest in ecologically valid MCP assessment.

The taxonomy reveals neighboring evaluation approaches: Security and Robustness Evaluation focuses on adversarial protocol violations, Capability Probing examines interaction dimensions like proactivity and compliance, and Automated Evaluation Pipelines address task generation at scale. LiveMCPBench diverges from these by emphasizing live-system dynamics and multi-server routing rather than security threats or automated synthesis. The scope_note for Real-World Task Benchmarks explicitly includes 'live MCP server interaction,' positioning this work centrally within its designated category while excluding the security and training-data concerns addressed by sibling branches.

Among 30 candidates examined across three contributions, none yielded clear refutations. The LiveMCPBench benchmark contribution examined 10 candidates with zero refutable overlaps, as did the LiveMCPTool reproducible toolset and LiveMCPEval framework contributions. The empirical diagnosis contribution similarly found no prior work among 10 candidates that directly anticipated its analysis of tool composition patterns and retrieval error correlations. This limited search scope suggests the specific combination of large-scale routing, reproducible deployment infrastructure, and LLM-as-a-Judge outcome verification may represent a relatively unexplored configuration within the real-world benchmarking space.

The analysis reflects a targeted semantic search rather than exhaustive coverage, examining 30 candidates drawn from top-K matches and citation expansion. While the taxonomy shows six sibling papers in the same leaf, the contribution-level statistics indicate that within the examined sample, no single prior work directly subsumes the proposed benchmark's emphasis on scaled multi-server environments and dynamic outcome verification. The findings suggest incremental novelty in integrating these elements, though broader literature beyond the search scope may contain relevant precursors.

Taxonomy

Core-task Taxonomy Papers
48
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Large-scale tool retrieval and composition in MCP ecosystems. The field organizes around five main branches that together address how agents discover, select, and orchestrate tools within the Model Context Protocol framework. Tool Retrieval and Selection Methods focus on algorithms for matching agent needs to available capabilities, often leveraging graph-based representations or dynamic indexing strategies such as those in Dynamic Graph Tool Retrieval[1] and Tool to Agent Retrieval[3]. Multi-Agent Coordination and Protocol Integration examines how multiple agents communicate and share resources through standardized protocols, with works like Agent Interoperability Survey[2] and MCP Survey[5] exploring interoperability challenges. MCP Server Development and Tool Integration covers the engineering of server infrastructure and API wrappers that expose diverse functionalities, exemplified by domain-specific servers such as EnergyPlus MCP[19] and RAG MCP[18]. Benchmarking and Evaluation Frameworks provide systematic testbeds for measuring retrieval accuracy, composition correctness, and end-to-end task success, while Domain-Specific Applications demonstrate real-world deployments in areas ranging from urban logistics to bioinformatics. Within the benchmarking landscape, a growing cluster of works targets real-world task scenarios that stress-test retrieval and composition under realistic constraints. MCP Bench[7] and MCP AgentBench[15] offer controlled environments for protocol-level evaluation, whereas LiveMCPBench[0] emphasizes dynamic, live-system benchmarks that capture the complexity of evolving tool ecosystems and concurrent agent interactions. Compared to more static or synthetic benchmarks like MSC Bench[44], LiveMCPBench[0] prioritizes ecological validity by incorporating real-time server updates and multi-step composition challenges. This positions it alongside efforts such as MCP Universe[42] and MCP Atlas[47], which also aim to map the breadth of available tools, but LiveMCPBench[0] distinguishes itself by focusing on continuous evaluation rather than cataloging alone. These contrasts highlight an open question in the field: how to balance reproducibility with the need to assess agents in fluid, production-like settings where tool availability and API semantics shift over time.

Claimed Contributions

LiveMCPBench benchmark for large-scale MCP evaluation

The authors present LiveMCPBench, a new benchmark containing 95 real-world daily tasks across six domains (Office, Lifestyle, Leisure, Finance, Travel, Shopping) designed to evaluate agents' capabilities in retrieving and composing tools from large-scale Model Context Protocol (MCP) ecosystems, addressing limitations of prior benchmarks that assume single-server settings.

10 retrieved papers
LiveMCPTool: reproducible toolset and LiveMCPEval evaluation framework

The authors contribute LiveMCPTool, a curated collection of 70 MCP servers with 527 tools that is ready-to-deploy without scattered API configuration, and LiveMCPEval, an LLM-as-a-Judge evaluation framework that automatically verifies task outcomes in dynamic settings while supporting multiple valid solution paths.

10 retrieved papers
Empirical diagnosis of MCP agent capabilities

The authors conduct a comprehensive empirical study benchmarking 10 state-of-the-art LLMs, revealing that tool retrieval errors account for nearly half of all failures and that active tool composition strongly correlates with task success, thereby identifying key bottlenecks and providing insights for future MCP agent research.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LiveMCPBench benchmark for large-scale MCP evaluation

The authors present LiveMCPBench, a new benchmark containing 95 real-world daily tasks across six domains (Office, Lifestyle, Leisure, Finance, Travel, Shopping) designed to evaluate agents' capabilities in retrieving and composing tools from large-scale Model Context Protocol (MCP) ecosystems, addressing limitations of prior benchmarks that assume single-server settings.

Contribution

LiveMCPTool: reproducible toolset and LiveMCPEval evaluation framework

The authors contribute LiveMCPTool, a curated collection of 70 MCP servers with 527 tools that is ready-to-deploy without scattered API configuration, and LiveMCPEval, an LLM-as-a-Judge evaluation framework that automatically verifies task outcomes in dynamic settings while supporting multiple valid solution paths.

Contribution

Empirical diagnosis of MCP agent capabilities

The authors conduct a comprehensive empirical study benchmarking 10 state-of-the-art LLMs, revealing that tool retrieval errors account for nearly half of all failures and that active tool composition strongly correlates with task success, thereby identifying key bottlenecks and providing insights for future MCP agent research.