LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Model Context ProtocolMCP-useBenchmark

Model Context Protocol (MCP) has become a key infrastructure for connecting LLMs with external tools, scaling to 10,000+ MCP servers with diverse tools. Unfortunately, there is still a large gap between real-world MCP usage and current evaluation: they typically assume single-server settings and directly inject tools into the model’s context, bypassing the challenges of large-scale retrieval and multi-tool composition. To bridge this gap, we propose LiveMCPBench, which evaluates 95 real-world daily tasks explicitly constructed to stress diverse tools and scaled multi-server routing. The benchmark includes a ready-to-deploy tool suite of 70 servers with 527 tools, ensuring reproducibility without scattered API configuration. We further introduce an LLM-as-a-Judge evaluation framework that directly verifies task outcomes, handling dynamic data sources and multiple valid solution paths. We benchmark 10 state-of-the-art LLMs and observe a substantial performance gap: while Claude-Sonnet-4 reaches 78.95% task success, most models achieve only 30–50%. Our analysis reveals that active tool composition strongly correlates with task success, whereas retrieval errors account for nearly half of all failures-highlighting retrieval as the dominant bottleneck. Together, these results provide the first large-scale, reproducible diagnosis of MCP agent capabilities and point towards future research on improving retrieval robustness and encouraging effective tool composition. Code and data will be released upon publication.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

LiveMCPBench introduces a benchmark for evaluating LLM agents on 95 real-world daily tasks requiring large-scale tool retrieval and multi-server composition, accompanied by a reproducible suite of 70 servers and 527 tools. The paper sits within the Real-World Task Benchmarks leaf of the taxonomy, which contains six sibling papers addressing authentic multi-step evaluation scenarios. This leaf represents a moderately populated research direction within the broader Benchmarking and Evaluation Frameworks branch, indicating active but not overcrowded interest in ecologically valid MCP assessment.

The taxonomy reveals neighboring evaluation approaches: Security and Robustness Evaluation focuses on adversarial protocol violations, Capability Probing examines interaction dimensions like proactivity and compliance, and Automated Evaluation Pipelines address task generation at scale. LiveMCPBench diverges from these by emphasizing live-system dynamics and multi-server routing rather than security threats or automated synthesis. The scope_note for Real-World Task Benchmarks explicitly includes 'live MCP server interaction,' positioning this work centrally within its designated category while excluding the security and training-data concerns addressed by sibling branches.

Among 30 candidates examined across three contributions, none yielded clear refutations. The LiveMCPBench benchmark contribution examined 10 candidates with zero refutable overlaps, as did the LiveMCPTool reproducible toolset and LiveMCPEval framework contributions. The empirical diagnosis contribution similarly found no prior work among 10 candidates that directly anticipated its analysis of tool composition patterns and retrieval error correlations. This limited search scope suggests the specific combination of large-scale routing, reproducible deployment infrastructure, and LLM-as-a-Judge outcome verification may represent a relatively unexplored configuration within the real-world benchmarking space.

The analysis reflects a targeted semantic search rather than exhaustive coverage, examining 30 candidates drawn from top-K matches and citation expansion. While the taxonomy shows six sibling papers in the same leaf, the contribution-level statistics indicate that within the examined sample, no single prior work directly subsumes the proposed benchmark's emphasis on scaled multi-server environments and dynamic outcome verification. The findings suggest incremental novelty in integrating these elements, though broader literature beyond the search scope may contain relevant precursors.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Large-scale tool retrieval and composition in MCP ecosystems. The field organizes around five main branches that together address how agents discover, select, and orchestrate tools within the Model Context Protocol framework. Tool Retrieval and Selection Methods focus on algorithms for matching agent needs to available capabilities, often leveraging graph-based representations or dynamic indexing strategies such as those in Dynamic Graph Tool Retrieval[1] and Tool to Agent Retrieval[3]. Multi-Agent Coordination and Protocol Integration examines how multiple agents communicate and share resources through standardized protocols, with works like Agent Interoperability Survey[2] and MCP Survey[5] exploring interoperability challenges. MCP Server Development and Tool Integration covers the engineering of server infrastructure and API wrappers that expose diverse functionalities, exemplified by domain-specific servers such as EnergyPlus MCP[19] and RAG MCP[18]. Benchmarking and Evaluation Frameworks provide systematic testbeds for measuring retrieval accuracy, composition correctness, and end-to-end task success, while Domain-Specific Applications demonstrate real-world deployments in areas ranging from urban logistics to bioinformatics. Within the benchmarking landscape, a growing cluster of works targets real-world task scenarios that stress-test retrieval and composition under realistic constraints. MCP Bench[7] and MCP AgentBench[15] offer controlled environments for protocol-level evaluation, whereas LiveMCPBench[0] emphasizes dynamic, live-system benchmarks that capture the complexity of evolving tool ecosystems and concurrent agent interactions. Compared to more static or synthetic benchmarks like MSC Bench[44], LiveMCPBench[0] prioritizes ecological validity by incorporating real-time server updates and multi-step composition challenges. This positions it alongside efforts such as MCP Universe[42] and MCP Atlas[47], which also aim to map the breadth of available tools, but LiveMCPBench[0] distinguishes itself by focusing on continuous evaluation rather than cataloging alone. These contrasts highlight an open question in the field: how to balance reproducibility with the need to assess agents in fluid, production-like settings where tool availability and API semantics shift over time.

Claimed Contributions

LiveMCPBench benchmark for large-scale MCP evaluation

10 retrieved papers

The authors present LiveMCPBench, a new benchmark containing 95 real-world daily tasks across six domains (Office, Lifestyle, Leisure, Finance, Travel, Shopping) designed to evaluate agents' capabilities in retrieving and composing tools from large-scale Model Context Protocol (MCP) ecosystems, addressing limitations of prior benchmarks that assume single-server settings.

10 retrieved papers

LiveMCPTool: reproducible toolset and LiveMCPEval evaluation framework

10 retrieved papers

The authors contribute LiveMCPTool, a curated collection of 70 MCP servers with 527 tools that is ready-to-deploy without scattered API configuration, and LiveMCPEval, an LLM-as-a-Judge evaluation framework that automatically verifies task outcomes in dynamic settings while supporting multiple valid solution paths.

10 retrieved papers

Empirical diagnosis of MCP agent capabilities

10 retrieved papers

The authors conduct a comprehensive empirical study benchmarking 10 state-of-the-art LLMs, revealing that tool retrieval errors account for nearly half of all failures and that active tool composition strongly correlates with task success, thereby identifying key bottlenecks and providing insights for future MCP agent research.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers PDF

Wang Zhen-ting, Chang Qi, Zhenting Wang, Qi Chang, Hemani Patel, Wu Cheng-En, Shashank Biju, Liu Quan, Cheng-En Wu, Ding, Aolin, Quan Liu, Rezazadeh Alireza, Aolin Ding, Shah, Ankit, Alireza Rezazadeh, Bao, Yujia, Ankit Shah, Siow, Eugene, Yujia Bao, Eugene Siow (2025)

[15] MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools PDF

Guo Zikang, Xu, Benfeng, Zikang Guo, Benfeng Xu, Hong, Wentao, Chiwei Zhu, Wang Xiao-rui, Wentao Hong, Mao, Zhendong, Xiaorui Wang, Zhendong Mao (2025)

[42] MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers PDF

Luo, Ziyang, Shen, Zhiqi, Ziyang Luo, Yang Wenzhuo, Zhiqi Shen, Zhao, Zirui, Wenzhuo Yang, Jwalapuram, Prathyusha, Zirui Zhao, Saha, Amrita, Prathyusha Jwalapuram, Sahoo, Doyen, Amrita Saha, Savarese, Silvio, Doyen Sahoo, Xiong, Caiming, Silvio Savarese, Li, Junnan, Caiming Xiong, Junnan Li (2025)

[44] MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration PDF

JK Dong, IW Huang, CT Wu, YT Tsai (2025)

[47] MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers PDF

C Bandi, B Hertzberg, G Boo, T Polakam, J Da (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LiveMCPBench benchmark for large-scale MCP evaluation

[7] Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers PDF

Cannot Refute

[59] PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature PDF

Cannot Refute

[60] ART: Automatic multi-step reasoning and tool-use for large language models PDF

Cannot Refute

[61] Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl PDF

Cannot Refute

[62] API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs PDF

Cannot Refute

[63] InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents PDF

Cannot Refute

[64] Deep research agents: A systematic examination and roadmap PDF

Cannot Refute

[65] Intellagent: A multi-agent framework for evaluating conversational ai systems PDF

Cannot Refute

[66] Emergent Tool Use From Multi-Agent Autocurricula PDF

Cannot Refute

[67] Webshaper: Agentically data synthesizing via information-seeking formalization PDF

Cannot Refute

Contribution

LiveMCPTool: reproducible toolset and LiveMCPEval evaluation framework

[68] Agent-as-a-Judge: Evaluate Agents with Agents PDF

Cannot Refute

[69] A3: Android agent arena for mobile gui agents PDF

Cannot Refute

[70] TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents PDF

Cannot Refute

[71] PersonaGym: Evaluating Persona Agents and LLMs PDF

Cannot Refute

[72] ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios PDF

Cannot Refute

[73] Towards Full Delegation: Designing Ideal Agentic Behaviors for Travel Planning PDF

Cannot Refute

[74] Beyond Static Evaluation: Rethinking the Assessment of Personalized Agent Adaptability in Information Retrieval PDF

Cannot Refute

[75] MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them PDF

Cannot Refute

[76] Building trust in mental health chatbots: Safety metrics and LLM-based evaluation tools PDF

Cannot Refute

[77] The Vision of Autonomic Computing: Can LLMs Make It a Reality? PDF

Cannot Refute

Contribution

Empirical diagnosis of MCP agent capabilities

[49] AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation PDF

Cannot Refute

[50] Voyager: An open-ended embodied agent with large language models PDF

Cannot Refute

[51] Llava-plus: Learning to use tools for creating multimodal agents PDF

Cannot Refute

[52] CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges PDF

Cannot Refute

[53] ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving PDF

Cannot Refute

[54] Visualtoolagent (vista): A reinforcement learning framework for visual tool selection PDF

Cannot Refute

[55] Large language model agents can use tools to perform clinical calculations PDF

Cannot Refute

[56] Medrax: Medical reasoning agent for chest x-ray PDF

Cannot Refute

[57] GIS copilot: Towards an autonomous GIS agent for spatial analysis PDF

Cannot Refute

[58] MeNTi: Bridging medical calculator and LLM agent with nested tool calling PDF

Cannot Refute

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers PDF

[15] MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools PDF

[42] MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers PDF

[44] MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration PDF

[47] MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers PDF

Contribution Analysis

LiveMCPBench benchmark for large-scale MCP evaluation

[7] Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers PDF

[59] PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature PDF

[60] ART: Automatic multi-step reasoning and tool-use for large language models PDF

[61] Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl PDF

[62] API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs PDF

[63] InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents PDF

[64] Deep research agents: A systematic examination and roadmap PDF

[65] Intellagent: A multi-agent framework for evaluating conversational ai systems PDF

[66] Emergent Tool Use From Multi-Agent Autocurricula PDF

[67] Webshaper: Agentically data synthesizing via information-seeking formalization PDF

LiveMCPTool: reproducible toolset and LiveMCPEval evaluation framework

[68] Agent-as-a-Judge: Evaluate Agents with Agents PDF

[69] A3: Android agent arena for mobile gui agents PDF

[70] TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents PDF

[71] PersonaGym: Evaluating Persona Agents and LLMs PDF

[72] ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios PDF

[73] Towards Full Delegation: Designing Ideal Agentic Behaviors for Travel Planning PDF

[74] Beyond Static Evaluation: Rethinking the Assessment of Personalized Agent Adaptability in Information Retrieval PDF

[75] MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them PDF

[76] Building trust in mental health chatbots: Safety metrics and LLM-based evaluation tools PDF

[77] The Vision of Autonomic Computing: Can LLMs Make It a Reality? PDF

Empirical diagnosis of MCP agent capabilities

[49] AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation PDF

[50] Voyager: An open-ended embodied agent with large language models PDF

[51] Llava-plus: Learning to use tools for creating multimodal agents PDF

[52] CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges PDF

[53] ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving PDF

[54] Visualtoolagent (vista): A reinforcement learning framework for visual tool selection PDF

[55] Large language model agents can use tools to perform clinical calculations PDF

[56] Medrax: Medical reasoning agent for chest x-ray PDF

[57] GIS copilot: Towards an autonomous GIS agent for spatial analysis PDF

[58] MeNTi: Bridging medical calculator and LLM agent with nested tool calling PDF

Table of Contents