MCP-Universe: Benchmarking Large Language Models with
Real-World Model Context Protocol Servers

MCP-Universe is a comprehensive framework designed for developing, testing, and benchmarking AI agents. It offers a robust platform for building and evaluating both AI agents and LLMs across a wide range of task environments. The framework also supports seamless integration with external MCP servers and facilitates sophisticated agent orchestration workflows.

MCP-Universe Overview

Figure 1: Example from MCP-Universe illustrating realistic challenges, including real-world tool usage, long-horizon multi-turn tool calls, long context windows, scattered evidence, and large tool spaces. Unlike prior work, MCP-Universe is grounded in real-world MCP servers connected to actual data sources and environments.

Leaderboard

Want to add your model/agent to the leaderboard? Contact us to submit new results!

Performance comparison across different LLMs and Agents on MCP-Universe benchmark. Success Rate (SR), Average Evaluator Score, and Average Steps reported.

Proprietary Model
Open-Source Model

LLM (w/ ReAct) Track: Evaluates large language models using the ReAct (Reasoning and Acting) prompting strategy, in which models alternate between reasoning about the task and taking actions through MCP server function calls. In each step, the LLM can use only one tool.

Model Location
Navigation
Repository
Management
Financial
Analysis
3D
Designing
Browser
Automation
Web
Searching
Average
Evaluator Score
Average
Steps
Overall
Success Rate
GPT-5-High 26.67 30.30 67.50 57.89 43.59 45.45 62.82 6.84 44.16
GPT-5-Medium 33.33 30.30 67.50 52.63 35.90 45.45 60.23 8.22 43.72
Grok-4 28.89 12.12 40.00 26.32 41.03 41.82 49.01 7.75 33.33
Claude-4.0-Sonnet-Thinking 17.78 12.12 55.00 42.11 38.46 23.64 51.74 6.91 30.30
Claude-4.1-Opus-No-Thinking 17.78 21.21 52.50 36.84 35.90 20.00 49.14 7.04 29.44
Claude-4.0-Opus-No-Thinking 15.56 15.15 55.00 31.58 38.46 18.18 46.40 7.69 28.14
Claude-4.0-Sonnet-No-Thinking 22.22 12.12 55.00 26.32 38.46 21.82 50.61 7.46 29.44
Grok-Code-Fast-1 26.67 9.09 62.50 15.79 20.51 18.18 44.72 6.87 26.41
o3-Medium 26.67 6.06 40.00 26.32 25.64 29.09 38.95 4.82 26.41
o4-mini-Medium 26.67 18.18 40.00 36.84 23.08 18.18 40.38 7.90 25.97
Claude-3.7-Sonnet-No-Thinking 13.33 18.18 40.00 36.84 23.08 21.82 40.36 7.16 24.24
Gemini-2.5-Pro 13.33 12.12 50.00 21.05 25.64 12.73 36.93 6.98 22.08
Gemini-2.5-Flash 15.56 12.12 37.50 21.05 30.77 14.55 33.99 8.26 21.65
GPT-4.1 8.89 6.06 40.00 26.32 23.08 10.91 41.32 5.24 18.18
GPT-4o-2024-08-06 8.89 9.09 35.00 26.32 12.82 9.09 37.03 6.03 15.58
GLM-4.5 17.78 9.09 50.00 26.32 15.38 27.27 41.16 7.33 24.68
Qwen3-Coder-480B-A35B-Instruct 13.33 3.03 57.50 31.58 30.77 9.09 41.39 7.77 22.94
DeepSeek-V3.1 15.56 0.00 42.50 31.58 28.21 18.18 43.23 6.31 22.08
Kimi-K2-0905 11.11 3.03 52.50 15.79 25.64 10.91 41.28 6.96 19.91
Kimi-K2-0711 11.11 9.09 47.50 15.79 15.38 14.55 35.10 6.07 19.05
Qwen3-Max-Preview (Instruct) 20.00 6.06 42.50 31.58 10.26 7.27 37.74 5.50 18.18
Qwen3-235B-A22B-Instruct-2507 11.11 9.09 50.00 15.79 15.38 9.09 38.53 5.74 18.18
DeepSeek-V3 11.11 6.06 30.00 26.32 12.82 7.27 35.82 5.06 14.29
GPT-OSS-120B 6.67 6.06 35.00 10.53 5.13 5.45 26.34 - 11.26
Performance Insights
Even SOTA models show significant limitations, with the best model (GPT-5) achieving only 43.72% overall success rate. This highlights the challenging nature of real-world MCP server interactions and the substantial room for improvement in current LLM agents. The gap between proprietary and open-source models remains substantial.

Key Findings

Long-Context Challenge

Token count increases rapidly with interaction steps, often leading to context overflow and degraded performance in multi-step tasks requiring extensive reasoning.

Unknown-Tools Challenge

LLM agents often lack familiarity with precise usage patterns, parameter specifications, and expected behaviors of diverse MCP servers.

Cross-Domain Variations

Models show markedly different success rates across application domains, suggesting domain-specific optimization needs and knowledge gaps.

Citation

BibTeX
@misc{mcpuniverse,
  title={MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers},
  author={Ziyang Luo and Zhiqi Shen and Wenzhuo Yang and Zirui Zhao and Prathyusha Jwalapuram and Amrita Saha and Doyen Sahoo and Silvio Savarese and Caiming Xiong and Junnan Li},
  year={2025},
  eprint={2508.14704},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2508.14704}, 
}