MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.3 Download Report PDF

Large Language ModelsAgentTool UseBenchmarkModel Context Protocol

The MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this, we propose \texttt{MCPMark}, a benchmark designed to evaluate realistic and comprehensive MCP use, comprising $127$ high-quality tasks collaboratively created by human experts and AI agents. Specifically, each task starts from a curated initial state and incldes a programmatic script for automatic verification. Moreover, these tasks require richer and more varied interactions with the environment, involving diverse create, read, update, and delete (CRUD) operations. We conduct comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, \texttt{gpt-5-medium}, reaches only $52.56$ % pass@1 and $33.86$ % pass^4, while other widely regarded strong models, including \texttt{claude-sonnet-4} and \texttt{o3}, fall below $30$ % pass@1 and $15$ % pass^4. On average, LLMs require $16.18$ execution turns and $17.38$ tool calls per task, substantially exceeding those in previous MCP benchmarks and demonstrating the stress-testing nature of \texttt{MCPMark}.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MCPMark, a benchmark comprising 127 tasks designed to evaluate realistic and comprehensive MCP use through diverse CRUD operations and programmatic verification. It resides in the 'Comprehensive Multi-Task MCP Benchmarks' leaf alongside three sibling papers (MCPToolBench, MCP RADAR, MCP Universe), indicating a moderately populated research direction within the broader 50-paper taxonomy. This leaf sits under 'MCP Benchmarking and Evaluation Frameworks,' one of six major branches, suggesting the paper targets a recognized but not overcrowded niche focused on holistic agent evaluation rather than narrow domain-specific or security-focused testing.

The taxonomy reveals neighboring leaves addressing specialized evaluation contexts (tool navigation at scale, stress testing) and general tool-use benchmarks with MCP relevance, while sibling branches cover security analysis, architecture studies, and domain applications. MCPMark's emphasis on realistic workflows and varied interaction depth distinguishes it from specialized evaluation contexts that probe specific dimensions like tool selection or adversarial robustness. The scope note for its leaf explicitly excludes narrow single-domain and security-specific benchmarks, positioning MCPMark as a general-purpose evaluation suite that bridges functional capability assessment and ecological validity without venturing into threat modeling or domain-specific deployments.

Among 24 candidates examined, the benchmark contribution (Contribution A) faced 4 candidates with 0 refutations, suggesting limited direct overlap in the search scope. The human-AI collaborative task creation pipeline (Contribution B) examined 10 candidates and found 2 refutable cases, indicating some prior work on collaborative dataset construction methods. The evaluation framework (Contribution C) examined 10 candidates with 0 refutations, implying the agent-based assessment approach appears relatively distinct within the limited search. These statistics reflect a targeted semantic search rather than exhaustive coverage, so unexamined prior work may exist beyond the top-K matches.

Given the limited search scope of 24 candidates, the analysis captures immediate semantic neighbors but cannot claim comprehensive field coverage. The benchmark's position in a moderately populated leaf with three siblings suggests incremental rather than pioneering novelty, though the specific emphasis on realistic CRUD operations and programmatic verification may offer differentiation. The collaborative task creation pipeline shows measurable overlap with existing methods, while the evaluation framework appears more distinctive within the examined set. A broader literature review would be needed to assess whether similar comprehensive MCP benchmarks exist outside the top-K semantic matches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating realistic and comprehensive Model Context Protocol use. The field has rapidly organized around six main branches that reflect both technical and applied concerns. MCP Benchmarking and Evaluation Frameworks focus on creating standardized testbeds and multi-task suites to measure agent performance across diverse scenarios, with works like MCPMark[0] and MCPToolBench[5] providing comprehensive task collections. MCP Security and Safety Analysis addresses vulnerabilities, attack surfaces, and defensive mechanisms, as seen in Red Teaming MCP[1] and MCP Guard[6]. MCP Architecture, Protocols, and Infrastructure examines the underlying design choices, interoperability standards, and scalability challenges, with contributions such as ScaleMCP[2] and MCP Standardization Analysis[18]. MCP-Enabled Multi-Agent and Coordination Systems explores how multiple agents collaborate through the protocol, while Domain-Specific MCP Applications demonstrate real-world deployments in healthcare, manufacturing, IoT, and other sectors. Finally, MCP Conceptual Foundations and Future Directions surveys broader integration pathways and long-term research questions. A particularly active tension exists between comprehensive benchmarking efforts and security-focused evaluations. Many studies emphasize breadth—covering tool use, reasoning, and multi-step workflows—while others probe adversarial robustness and privacy risks. MCPMark[0] sits squarely within the Comprehensive Multi-Task MCP Benchmarks cluster, aiming to provide a holistic evaluation suite that spans varied task types and realistic interaction patterns. This positions it alongside neighbors like MCP RADAR[22] and MCP Universe[26], which similarly pursue broad coverage but may differ in their emphasis on dynamic environments or evolving datasets. Compared to more narrowly scoped domain benchmarks or security-specific testbeds, MCPMark[0] prioritizes generality and ecological validity, reflecting an ongoing debate about whether unified benchmarks can adequately capture both functional capabilities and safety properties across the rapidly diversifying MCP ecosystem.

Claimed Contributions

MCPMark benchmark for realistic and comprehensive MCP use

4 retrieved papers

The authors introduce MCPMark, a benchmark containing 127 tasks across five MCP environments (Notion, GitHub, Filesystem, PostgreSQL, Playwright). Each task starts from a curated initial state and includes programmatic verification scripts, covering diverse CRUD operations to test realistic multi-step workflows.

4 retrieved papers

Human-AI collaborative task creation pipeline

Can Refute

10 retrieved papers

The authors develop a four-step pipeline (Exploration, Evolvement, Verification, Iteration) where domain experts work with AI agents to iteratively create task instructions and programmatic verification scripts, ensuring tasks are realistic, verifiable, and challenging.

10 retrieved papers

Can Refute

MCPMark-Agent evaluation framework

10 retrieved papers

The authors provide a lightweight agent framework built on LiteLLM and the MCP Python SDK that supports multiple model providers and MCP servers. It enables consistent evaluation through a tool-calling loop with full state tracking and programmatic verification.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark PDF

Fan Shiqing, Ding, Xichen, Zhang Liang, Mo Linjian (2025) • arXiv.org

[22] MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models PDF

Gao Xuanqi, Zhai, Juan, Ma, Shiqing, Shen Chao (2025) • arXiv.org

[26] MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers PDF

Luo, Ziyang, Shen, Zhiqi, Yang Wenzhuo, Zhao, Zirui, Jwalapuram, Prathyusha, Saha, Amrita, Sahoo, Doyen, Savarese, Silvio, Xiong, Caiming, Li, Junnan (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MCPMark benchmark for realistic and comprehensive MCP use

[2] ScaleMCP: Dynamic and Auto-Synchronizing Model Context Protocol Tools for LLM Agents PDF

Cannot Refute

[61] BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions PDF

Cannot Refute

[62] Can LLMs Help You at Work? A Sandbox for Evaluating LLM Agents in Enterprise Environments PDF

Cannot Refute

[63] BridgeScope: A Universal Toolkit for Bridging Large Language Models and Databases PDF

Cannot Refute

Contribution

Human-AI collaborative task creation pipeline

[66] CulturalBench: A robust, diverse and challenging benchmark for measuring LMs' cultural knowledge through human-AI red-teaming PDF

Can Refute

[71] CulturalBench: A Robust, Diverse, and Challenging Cultural Benchmark by Human-AI CulturalTeaming PDF

Can Refute

[64] Hybrideval: a human-AI collaborative approach for evaluating design ideas at scale PDF

Cannot Refute

[65] ARE: Scaling Up Agent Environments and Evaluations PDF

Cannot Refute

[67] Beyond Benchmarks: How Relational Engagement Protocols Transform Human-AI Collaboration Effectiveness PDF

Cannot Refute

[68] Aligning human and llm judgments: Insights from evalassist on task-specific evaluations and ai-assisted assessment strategy preferences PDF

Cannot Refute

[69] Reproducible generative artificial intelligence evaluation for health care: a clinician-in-the-loop approach PDF

Cannot Refute

[70] Building a Human-Verified Clinical Reasoning Dataset via a Human LLM Hybrid Pipeline for Trustworthy Medical AI PDF

Cannot Refute

[72] ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly PDF

Cannot Refute

[73] Beyond Transparency: Evaluating Explainability in AI-Supported Fact-Checking PDF

Cannot Refute

Contribution

MCPMark-Agent evaluation framework

[51] Ï-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains PDF

Cannot Refute

[52] CORE: Full-Path Evaluation of LLM Agents Beyond Final State PDF

Cannot Refute

[53] Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling PDF

Cannot Refute

[54] Can a single model master both multi-turn conversations and tool use? coalm: A unified conversational agentic language model PDF

Cannot Refute

[55] Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities PDF

Cannot Refute

[56] Verltool: Towards holistic agentic reinforcement learning with tool use PDF

Cannot Refute

[57] MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations PDF

Cannot Refute

[58] Ai agentic programming: A survey of techniques, challenges, and opportunities PDF

Cannot Refute

[59] In-the-flow agentic system optimization for effective planning and tool use PDF

Cannot Refute

[60] Holistic agent leaderboard: The missing infrastructure for ai agent evaluation PDF

Cannot Refute

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark PDF

[22] MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models PDF

[26] MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers PDF

Contribution Analysis

MCPMark benchmark for realistic and comprehensive MCP use

[2] ScaleMCP: Dynamic and Auto-Synchronizing Model Context Protocol Tools for LLM Agents PDF

[61] BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions PDF

[62] Can LLMs Help You at Work? A Sandbox for Evaluating LLM Agents in Enterprise Environments PDF

[63] BridgeScope: A Universal Toolkit for Bridging Large Language Models and Databases PDF

Human-AI collaborative task creation pipeline

[66] CulturalBench: A robust, diverse and challenging benchmark for measuring LMs' cultural knowledge through human-AI red-teaming PDF

[71] CulturalBench: A Robust, Diverse, and Challenging Cultural Benchmark by Human-AI CulturalTeaming PDF

[64] Hybrideval: a human-AI collaborative approach for evaluating design ideas at scale PDF

[65] ARE: Scaling Up Agent Environments and Evaluations PDF

[67] Beyond Benchmarks: How Relational Engagement Protocols Transform Human-AI Collaboration Effectiveness PDF

[68] Aligning human and llm judgments: Insights from evalassist on task-specific evaluations and ai-assisted assessment strategy preferences PDF

[69] Reproducible generative artificial intelligence evaluation for health care: a clinician-in-the-loop approach PDF

[70] Building a Human-Verified Clinical Reasoning Dataset via a Human LLM Hybrid Pipeline for Trustworthy Medical AI PDF

[72] ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly PDF

[73] Beyond Transparency: Evaluating Explainability in AI-Supported Fact-Checking PDF

MCPMark-Agent evaluation framework

[51] Ï-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains PDF

[52] CORE: Full-Path Evaluation of LLM Agents Beyond Final State PDF

[53] Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling PDF

[54] Can a single model master both multi-turn conversations and tool use? coalm: A unified conversational agentic language model PDF

[55] Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities PDF

[56] Verltool: Towards holistic agentic reinforcement learning with tool use PDF

[57] MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations PDF

[58] Ai agentic programming: A survey of techniques, challenges, and opportunities PDF

[59] In-the-flow agentic system optimization for effective planning and tool use PDF

[60] Holistic agent leaderboard: The missing infrastructure for ai agent evaluation PDF

Table of Contents

[51] Ï-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains PDF