MCP-Universe: Benchmarking Large Language Models with
Real-World Model Context Protocol Servers
MCP-Universe is a comprehensive framework designed for developing, testing, and benchmarking AI agents. It offers a robust platform for building and evaluating both AI agents and LLMs across a wide range of task environments. The framework also supports seamless integration with external MCP servers and facilitates sophisticated agent orchestration workflows.

Figure 1: Example from MCP-Universe illustrating realistic challenges, including real-world tool usage, long-horizon multi-turn tool calls, long context windows, scattered evidence, and large tool spaces. Unlike prior work, MCP-Universe is grounded in real-world MCP servers connected to actual data sources and environments.
Leaderboard
Performance comparison across different LLMs and Agents on MCP-Universe benchmark. Success Rate (SR), Average Evaluator Score, and Average Steps reported.
LLM (w/ ReAct) Track: Evaluates large language models using the ReAct (Reasoning and Acting) prompting strategy, in which models alternate between reasoning about the task and taking actions through MCP server function calls. In each step, the LLM can use only one tool.
Model | Location Navigation |
Repository Management |
Financial Analysis |
3D Designing |
Browser Automation |
Web Searching |
Average Evaluator Score |
Average Steps |
Overall Success Rate |
---|---|---|---|---|---|---|---|---|---|
GPT-5-High | 26.67 | 30.30 | 67.50 | 57.89 | 43.59 | 45.45 | 62.82 | 6.84 | 44.16 |
GPT-5-Medium | 33.33 | 30.30 | 67.50 | 52.63 | 35.90 | 45.45 | 60.23 | 8.22 | 43.72 |
Grok-4 | 28.89 | 12.12 | 40.00 | 26.32 | 41.03 | 41.82 | 49.01 | 7.75 | 33.33 |
Claude-4.0-Sonnet-Thinking | 17.78 | 12.12 | 55.00 | 42.11 | 38.46 | 23.64 | 51.74 | 6.91 | 30.30 |
Claude-4.1-Opus-No-Thinking | 17.78 | 21.21 | 52.50 | 36.84 | 35.90 | 20.00 | 49.14 | 7.04 | 29.44 |
Claude-4.0-Opus-No-Thinking | 15.56 | 15.15 | 55.00 | 31.58 | 38.46 | 18.18 | 46.40 | 7.69 | 28.14 |
Claude-4.0-Sonnet-No-Thinking | 22.22 | 12.12 | 55.00 | 26.32 | 38.46 | 21.82 | 50.61 | 7.46 | 29.44 |
Grok-Code-Fast-1 | 26.67 | 9.09 | 62.50 | 15.79 | 20.51 | 18.18 | 44.72 | 6.87 | 26.41 |
o3-Medium | 26.67 | 6.06 | 40.00 | 26.32 | 25.64 | 29.09 | 38.95 | 4.82 | 26.41 |
o4-mini-Medium | 26.67 | 18.18 | 40.00 | 36.84 | 23.08 | 18.18 | 40.38 | 7.90 | 25.97 |
Claude-3.7-Sonnet-No-Thinking | 13.33 | 18.18 | 40.00 | 36.84 | 23.08 | 21.82 | 40.36 | 7.16 | 24.24 |
Gemini-2.5-Pro | 13.33 | 12.12 | 50.00 | 21.05 | 25.64 | 12.73 | 36.93 | 6.98 | 22.08 |
Gemini-2.5-Flash | 15.56 | 12.12 | 37.50 | 21.05 | 30.77 | 14.55 | 33.99 | 8.26 | 21.65 |
GPT-4.1 | 8.89 | 6.06 | 40.00 | 26.32 | 23.08 | 10.91 | 41.32 | 5.24 | 18.18 |
GPT-4o-2024-08-06 | 8.89 | 9.09 | 35.00 | 26.32 | 12.82 | 9.09 | 37.03 | 6.03 | 15.58 |
GLM-4.5 | 17.78 | 9.09 | 50.00 | 26.32 | 15.38 | 27.27 | 41.16 | 7.33 | 24.68 |
Qwen3-Coder-480B-A35B-Instruct | 13.33 | 3.03 | 57.50 | 31.58 | 30.77 | 9.09 | 41.39 | 7.77 | 22.94 |
DeepSeek-V3.1 | 15.56 | 0.00 | 42.50 | 31.58 | 28.21 | 18.18 | 43.23 | 6.31 | 22.08 |
Kimi-K2-0905 | 11.11 | 3.03 | 52.50 | 15.79 | 25.64 | 10.91 | 41.28 | 6.96 | 19.91 |
Kimi-K2-0711 | 11.11 | 9.09 | 47.50 | 15.79 | 15.38 | 14.55 | 35.10 | 6.07 | 19.05 |
Qwen3-Max-Preview (Instruct) | 20.00 | 6.06 | 42.50 | 31.58 | 10.26 | 7.27 | 37.74 | 5.50 | 18.18 |
Qwen3-235B-A22B-Instruct-2507 | 11.11 | 9.09 | 50.00 | 15.79 | 15.38 | 9.09 | 38.53 | 5.74 | 18.18 |
DeepSeek-V3 | 11.11 | 6.06 | 30.00 | 26.32 | 12.82 | 7.27 | 35.82 | 5.06 | 14.29 |
GPT-OSS-120B | 6.67 | 6.06 | 35.00 | 10.53 | 5.13 | 5.45 | 26.34 | - | 11.26 |
Key Findings
Long-Context Challenge
Token count increases rapidly with interaction steps, often leading to context overflow and degraded performance in multi-step tasks requiring extensive reasoning.
Unknown-Tools Challenge
LLM agents often lack familiarity with precise usage patterns, parameter specifications, and expected behaviors of diverse MCP servers.
Cross-Domain Variations
Models show markedly different success rates across application domains, suggesting domain-specific optimization needs and knowledge gaps.