MCP-Universe: Benchmarking Large Language Models with
Real-World Model Context Protocol Servers

The first comprehensive benchmark specifically designed to evaluate LLMs in realistic and challenging tasks through interaction with real-world MCP servers across 6 core domains and 231 tasks.

Abstract

The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) to external data sources and tools, rapidly gaining adoption across major AI providers and development platforms. However, existing benchmarks are overly simplistic and fail to capture real application challenges such as long-horizon reasoning and large, unfamiliar tool spaces.

To address this critical gap, we introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers. Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. To ensure rigorous evaluation, we implement execution-based evaluators, including format evaluators for agent format compliance, static evaluators for time-invariant content matching, and dynamic evaluators that automatically retrieve real-time ground truth for temporally sensitive tasks.

Through extensive evaluation of leading LLMs, we find that even top-performing models such as GPT-5 (43.72% success rate), Grok-4 (33.33% success rate) and Claude-4.0-Sonnet (29.44% success rate) exhibit significant performance limitations. Beyond evaluation, we open-source our extensible evaluation framework with UI support, enabling researchers and practitioners to seamlessly integrate new agents and MCP servers while fostering innovation in the rapidly evolving MCP ecosystem.

MCP-Universe Overview

Figure 1: Example from MCP-Universe illustrating realistic challenges, including real-world tool usage, long-horizon multi-turn tool calls, long context windows, scattered evidence, and large tool spaces. Unlike prior work, MCP-Universe is grounded in real-world MCP servers connected to actual data sources and environments.

Key Statistics

231
Total Tasks
6
Core Domains
11
MCP Servers
84
Unique Evaluators

Benchmark Domains

Location Navigation

Real-world geospatial navigation tasks involving complex location queries, route planning, and geographic point calculations with actual map data.

45 tasks 19.5% of total

Web Searching

Advanced web search tasks requiring multi-step information retrieval, synthesis, and real-time data processing from various sources.

55 tasks 23.8% of total

Browser Automation

Complex browser automation tasks involving real-time web interactions, request submissions, and dynamic content extraction.

39 tasks 16.9% of total

3D Design

Three-dimensional modeling and design tasks using real Blender software tools with geometric constraints and design specifications.

19 tasks 8.2% of total

Financial Analysis

Real-time financial data analysis, quantitative investing, market research, and investment calculations involving temporal dynamics and live market data.

40 tasks 17.3% of total

Repository Management

Version control workflows, code repository management, and collaborative development tasks across different platforms like GitHub.

33 tasks 14.3% of total

Benchmark Results

Want to add your model/agent to the leaderboard? Contact us to submit new results!

Performance comparison across different LLMs and Agents on MCP-Universe benchmark. Success Rate (SR), Average Evaluator Score, and Average Steps reported.

Proprietary Model
Open-Source Model
Model Location
Navigation
Repository
Management
Financial
Analysis
3D
Designing
Browser
Automation
Web
Searching
Average
Evaluator Score
Average
Steps
Overall
Success Rate
GPT-5 33.33 30.30 67.50 52.63 35.90 45.45 60.23 8.22 43.72
Grok-4 28.89 12.12 40.00 26.32 41.03 41.82 49.01 7.75 33.33
Claude-4.0-Sonnet 22.22 12.12 55.00 26.32 38.46 21.82 50.61 7.46 29.44
o3 26.67 6.06 40.00 26.32 25.64 29.09 38.95 4.82 26.41
o4-mini 26.67 18.18 40.00 36.84 23.08 18.18 40.38 7.90 25.97
Claude-3.7-Sonnet 13.33 18.18 40.00 36.84 23.08 21.82 40.36 7.16 24.24
Gemini-2.5-Pro 13.33 12.12 50.00 21.05 25.64 12.73 36.93 6.98 22.08
Gemini-2.5-Flash 15.56 12.12 37.50 21.05 30.77 14.55 33.99 8.26 21.65
GPT-4.1 8.89 6.06 40.00 26.32 23.08 10.91 41.32 5.24 18.18
GPT-4o 8.89 9.09 35.00 26.32 12.82 9.09 37.03 6.03 15.58
GLM-4.5 17.78 9.09 50.00 26.32 15.38 27.27 41.16 7.33 24.68
Kimi-K2 11.11 9.09 47.50 15.79 15.38 14.55 35.10 6.07 19.05
Qwen3-Coder 8.89 3.03 50.00 26.32 25.64 10.91 37.78 7.78 19.91
Qwen3-235B 11.11 9.09 50.00 15.79 15.38 9.09 38.53 5.74 18.18
DeepSeek-V3 11.11 6.06 30.00 26.32 12.82 7.27 35.82 5.06 14.29
GPT-OSS-120B 6.67 6.06 35.00 10.53 5.13 5.45 26.34 - 11.26
Performance Insights
Even SOTA models show significant limitations, with the best model (GPT-5) achieving only 43.72% overall success rate. This highlights the challenging nature of real-world MCP server interactions and the substantial room for improvement in current LLM agents. The gap between proprietary and open-source models remains substantial.

Key Findings

Long-Context Challenge

Token count increases rapidly with interaction steps, often leading to context overflow and degraded performance in multi-step tasks requiring extensive reasoning.

Unknown-Tools Challenge

LLM agents often lack familiarity with precise usage patterns, parameter specifications, and expected behaviors of diverse MCP servers.

Cross-Domain Variations

Models show markedly different success rates across application domains, suggesting domain-specific optimization needs and knowledge gaps.

Task Examples

Location Navigation

Hi! My partner and I are planning a special pre-wedding road trip from Los Angeles to San Francisco as one last adventure before we tie the knot! We want to make this journey memorable before we start our married life together. Our plan is to drive through exactly 3 interesting cities between the starting and ending points to really enjoy this time together. Could you please map out exactly 2 distinct driving route options for this pre-wedding celebration? Oh, we must visit friends in Coalinga during our trip to share our exciting news with them!

Repository Management

Hi! I'm learning how to use GitHub and I want to practice exploring repositories and working with issues. Can you help me with a research project? I'd like to search for repositories owned by `google' that have `generative-ai' in their name. Once I find them, I want to count how many open issues each repository has that are labeled `type:bug'. This will help me understand how developers track bugs in real projects! After gathering this information, I need to practice creating my own repository called `google-generative-ai-issues' and uploading a CSV file named `google_generative_ai_bug_report.csv' to it.

Web Searching

I'm looking for a city based on the clues below: - The city has a football club that was formed in 1895. - One university in this city has a master's program that teaches Natural Language Processing with 7.5 credits. - One graduated PhD student of this university has published one paper at EACL 2021 and one at EACL 2023 as the first author. - One of the professors in this university is a Fellow of the ACL. You need to find the English name of the city.

Browser Automation

Help me find a one-way flight from Singapore to Beijing, 5 days from now (If now is 2025-07-07, then 5 days later is 2025-07-12). Find the flight on www.booking.com. I want to find the cheapest flight, direct flight, Economy, and I don't want to go to the Daxing airport. I only want to see the price, so as to determine whether I should fly to Beijing or not. Remember to close the browser after you finish the task.

Financial Analysis

Hi there! I'm completely new to investing and finance, and honestly, I'm feeling pretty overwhelmed by all the jargon and concepts. I've been trying to learn about something called 'fundamental analysis'. I think it has to do with looking at company finances? Anyway, I heard somewhere that you should look for companies where their net income (I think that's like profit?) has been going up for a few quarters in a row. I'm not really sure what that means exactly, but apparently 2 consecutive quarters of rising net income is a good sign? I'm still figuring out what makes a company worth investing in. Could you help a total beginner like me find 3 company tickers that have this pattern? I'm trying to learn by doing some basic research, even though I barely understand what I'm looking for. Any help would be amazing! I'm just trying to get my feet wet in this whole investing world!

3D Design

Create a Plane named 'Floor' scaled uniformly by 5. Create a Cylinder named 'Pillar' (default caps) with 16 vertices (sides), a radius of 0.5, and a depth of 4; position it at (X=-2, Y=-2, Z=2). Create a UV Sphere named 'Ball' with 32 segments and 16 rings; position it at (X=2, Y=2, Z=5). Create an Empty (Arrows type) named 'ControlTarget' at (X=0, Y=0, Z=3). Add a 'Track To' constraint to the 'Ball' object, making it track the 'ControlTarget'. Finally, create a Camera object, position it at (X=0, Y=-8, Z=3), and set its rotation so it looks directly at the 'Pillar' object's origin.

Citation

BibTeX
@misc{mcpuniverse,
  title={MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers},
  author={Ziyang Luo and Zhiqi Shen and Wenzhuo Yang and Zirui Zhao and Prathyusha Jwalapuram and Amrita Saha and Doyen Sahoo and Silvio Savarese and Caiming Xiong and Junnan Li},
  year={2025},
  eprint={2508.14704},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2508.14704}, 
}