MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelsAgentTool UseBenchmarkModel Context Protocol
Abstract:

The MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this, we propose \texttt{MCPMark}, a benchmark designed to evaluate realistic and comprehensive MCP use, comprising 127127 high-quality tasks collaboratively created by human experts and AI agents. Specifically, each task starts from a curated initial state and incldes a programmatic script for automatic verification. Moreover, these tasks require richer and more varied interactions with the environment, involving diverse create, read, update, and delete (CRUD) operations. We conduct comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, \texttt{gpt-5-medium}, reaches only 52.5652.56% pass@1 and 33.8633.86% pass^4, while other widely regarded strong models, including \texttt{claude-sonnet-4} and \texttt{o3}, fall below 3030% pass@1 and 1515% pass^4. On average, LLMs require 16.1816.18 execution turns and 17.3817.38 tool calls per task, substantially exceeding those in previous MCP benchmarks and demonstrating the stress-testing nature of \texttt{MCPMark}.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MCPMark, a benchmark comprising 127 tasks designed to evaluate realistic and comprehensive MCP use through diverse CRUD operations and programmatic verification. It resides in the 'Comprehensive Multi-Task MCP Benchmarks' leaf alongside three sibling papers (MCPToolBench, MCP RADAR, MCP Universe), indicating a moderately populated research direction within the broader 50-paper taxonomy. This leaf sits under 'MCP Benchmarking and Evaluation Frameworks,' one of six major branches, suggesting the paper targets a recognized but not overcrowded niche focused on holistic agent evaluation rather than narrow domain-specific or security-focused testing.

The taxonomy reveals neighboring leaves addressing specialized evaluation contexts (tool navigation at scale, stress testing) and general tool-use benchmarks with MCP relevance, while sibling branches cover security analysis, architecture studies, and domain applications. MCPMark's emphasis on realistic workflows and varied interaction depth distinguishes it from specialized evaluation contexts that probe specific dimensions like tool selection or adversarial robustness. The scope note for its leaf explicitly excludes narrow single-domain and security-specific benchmarks, positioning MCPMark as a general-purpose evaluation suite that bridges functional capability assessment and ecological validity without venturing into threat modeling or domain-specific deployments.

Among 24 candidates examined, the benchmark contribution (Contribution A) faced 4 candidates with 0 refutations, suggesting limited direct overlap in the search scope. The human-AI collaborative task creation pipeline (Contribution B) examined 10 candidates and found 2 refutable cases, indicating some prior work on collaborative dataset construction methods. The evaluation framework (Contribution C) examined 10 candidates with 0 refutations, implying the agent-based assessment approach appears relatively distinct within the limited search. These statistics reflect a targeted semantic search rather than exhaustive coverage, so unexamined prior work may exist beyond the top-K matches.

Given the limited search scope of 24 candidates, the analysis captures immediate semantic neighbors but cannot claim comprehensive field coverage. The benchmark's position in a moderately populated leaf with three siblings suggests incremental rather than pioneering novelty, though the specific emphasis on realistic CRUD operations and programmatic verification may offer differentiation. The collaborative task creation pipeline shows measurable overlap with existing methods, while the evaluation framework appears more distinctive within the examined set. A broader literature review would be needed to assess whether similar comprehensive MCP benchmarks exist outside the top-K semantic matches.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: evaluating realistic and comprehensive Model Context Protocol use. The field has rapidly organized around six main branches that reflect both technical and applied concerns. MCP Benchmarking and Evaluation Frameworks focus on creating standardized testbeds and multi-task suites to measure agent performance across diverse scenarios, with works like MCPMark[0] and MCPToolBench[5] providing comprehensive task collections. MCP Security and Safety Analysis addresses vulnerabilities, attack surfaces, and defensive mechanisms, as seen in Red Teaming MCP[1] and MCP Guard[6]. MCP Architecture, Protocols, and Infrastructure examines the underlying design choices, interoperability standards, and scalability challenges, with contributions such as ScaleMCP[2] and MCP Standardization Analysis[18]. MCP-Enabled Multi-Agent and Coordination Systems explores how multiple agents collaborate through the protocol, while Domain-Specific MCP Applications demonstrate real-world deployments in healthcare, manufacturing, IoT, and other sectors. Finally, MCP Conceptual Foundations and Future Directions surveys broader integration pathways and long-term research questions. A particularly active tension exists between comprehensive benchmarking efforts and security-focused evaluations. Many studies emphasize breadth—covering tool use, reasoning, and multi-step workflows—while others probe adversarial robustness and privacy risks. MCPMark[0] sits squarely within the Comprehensive Multi-Task MCP Benchmarks cluster, aiming to provide a holistic evaluation suite that spans varied task types and realistic interaction patterns. This positions it alongside neighbors like MCP RADAR[22] and MCP Universe[26], which similarly pursue broad coverage but may differ in their emphasis on dynamic environments or evolving datasets. Compared to more narrowly scoped domain benchmarks or security-specific testbeds, MCPMark[0] prioritizes generality and ecological validity, reflecting an ongoing debate about whether unified benchmarks can adequately capture both functional capabilities and safety properties across the rapidly diversifying MCP ecosystem.

Claimed Contributions

MCPMark benchmark for realistic and comprehensive MCP use

The authors introduce MCPMark, a benchmark containing 127 tasks across five MCP environments (Notion, GitHub, Filesystem, PostgreSQL, Playwright). Each task starts from a curated initial state and includes programmatic verification scripts, covering diverse CRUD operations to test realistic multi-step workflows.

4 retrieved papers
Human-AI collaborative task creation pipeline

The authors develop a four-step pipeline (Exploration, Evolvement, Verification, Iteration) where domain experts work with AI agents to iteratively create task instructions and programmatic verification scripts, ensuring tasks are realistic, verifiable, and challenging.

10 retrieved papers
Can Refute
MCPMark-Agent evaluation framework

The authors provide a lightweight agent framework built on LiteLLM and the MCP Python SDK that supports multiple model providers and MCP servers. It enables consistent evaluation through a tool-calling loop with full state tracking and programmatic verification.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MCPMark benchmark for realistic and comprehensive MCP use

The authors introduce MCPMark, a benchmark containing 127 tasks across five MCP environments (Notion, GitHub, Filesystem, PostgreSQL, Playwright). Each task starts from a curated initial state and includes programmatic verification scripts, covering diverse CRUD operations to test realistic multi-step workflows.

Contribution

Human-AI collaborative task creation pipeline

The authors develop a four-step pipeline (Exploration, Evolvement, Verification, Iteration) where domain experts work with AI agents to iteratively create task instructions and programmatic verification scripts, ensuring tasks are realistic, verifiable, and challenging.

Contribution

MCPMark-Agent evaluation framework

The authors provide a lightweight agent framework built on LiteLLM and the MCP Python SDK that supports multiple model providers and MCP servers. It enables consistent evaluation through a tool-calling loop with full state tracking and programmatic verification.