MCP-Universe Documentation

Abstract

The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) to external data sources and tools, rapidly gaining adoption across major AI providers and development platforms. However, existing benchmarks are overly simplistic and fail to capture real application challenges such as long-horizon reasoning and large, unfamiliar tool spaces.

To address this critical gap, we introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers. Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. To ensure rigorous evaluation, we implement execution-based evaluators, including format evaluators for agent format compliance, static evaluators for time-invariant content matching, and dynamic evaluators that automatically retrieve real-time ground truth for temporally sensitive tasks.

Through extensive evaluation of leading LLMs, we find that even top-performing models such as GPT-5 (43.72% success rate), Grok-4 (33.33% success rate) and Claude-4.1-Opus (29.44% success rate) exhibit significant performance limitations. Beyond evaluation, we open-source our extensible evaluation framework with UI support, enabling researchers and practitioners to seamlessly integrate new agents and MCP servers while fostering innovation in the rapidly evolving MCP ecosystem.

Key Statistics

231

Total Tasks

Core Domains

MCP Servers

Unique Evaluators

Benchmark Domains

Location Navigation

Real-world geospatial navigation tasks involving complex location queries, route planning, and geographic point calculations with actual map data.

45 tasks 19.5% of total

Web Searching

Advanced web search tasks requiring multi-step information retrieval, synthesis, and real-time data processing from various sources.

55 tasks 23.8% of total

Browser Automation

Complex browser automation tasks involving real-time web interactions, request submissions, and dynamic content extraction.

39 tasks 16.9% of total

3D Design

Three-dimensional modeling and design tasks using real Blender software tools with geometric constraints and design specifications.

19 tasks 8.2% of total

Financial Analysis

Real-time financial data analysis, quantitative investing, market research, and investment calculations involving temporal dynamics and live market data.

40 tasks 17.3% of total

Repository Management

Version control workflows, code repository management, and collaborative development tasks across different platforms like GitHub.

33 tasks 14.3% of total

Architecture Overview

The MCP-Universe architecture consists of the following key components:

Agents (mcpuniverse/agent/): Base implementations for different agent types
Workflows (mcpuniverse/workflows/): Orchestration and coordination layer
MCP Servers (mcpuniverse/mcp/): Protocol management and external service integration
LLM Integration (mcpuniverse/llm/): Multi-provider language model support
Benchmarking (mcpuniverse/benchmark/): Evaluation and testing framework
Dashboard (mcpuniverse/dashboard/): Visualization and monitoring interface

The diagram below illustrates the high-level view:

┌─────────────────────────────────────────────────────────────────┐
│                      Application Layer                          │
├─────────────────────────────────────────────────────────────────┤
│  Dashboard  │    Web API      │   Python Lib   │   Benchmarks   │
│   (Gradio)  │   (FastAPI)     │                │                │
└─────────────┬─────────────────┬────────────────┬────────────────┘
              │                 │                │
┌─────────────▼─────────────────▼────────────────▼────────────────┐
│                      Orchestration Layer                        │
├─────────────────────────────────────────────────────────────────┤
│           Workflows           │        Benchmark Runner         │
│    (Chain, Router, etc.)      │      (Evaluation Engine)        │
└─────────────┬─────────────────┬────────────────┬────────────────┘
              │                 │                │
┌─────────────▼─────────────────▼────────────────▼────────────────┐
│                        Agent Layer                              │
├─────────────────────────────────────────────────────────────────┤
│  BasicAgent │   ReActAgent    │  FunctionCall  │     Other      │
│             │                 │     Agent      │     Agents     │
└─────────────┬─────────────────┬────────────────┬────────────────┘
              │                 │                │
┌─────────────▼─────────────────▼────────────────▼────────────────┐
│                      Foundation Layer                           │
├─────────────────────────────────────────────────────────────────┤
│   MCP Manager   │   LLM Manager   │  Memory Systems │  Tracers  │
│   (Servers &    │   (Multi-Model  │   (RAM, Redis)  │ (Logging) │
│    Clients)     │    Support)     │                 │           │
└─────────────────┴─────────────────┴─────────────────┴───────────┘

More information can be found in the documentation.

Prerequisites

Before getting started with MCP-Universe, ensure you have the following prerequisites installed on your system:

Python: Requires version 3.10 or higher
Docker: Used for running Dockerized MCP servers
PostgreSQL (optional): Used for database storage and persistence
Redis (optional): Used for caching and memory management

Installation

Follow these steps to install MCP-Universe on your system:

1. Clone the repository

git clone https://github.com/SalesforceAIResearch/MCP-Universe.git
cd MCP-Universe

2. Create and activate virtual environment

python3 -m venv venv
source venv/bin/activate

3. Install dependencies

pip install -r requirements.txt
pip install -r dev-requirements.txt

4. Platform-specific requirements

Linux:

sudo apt-get install libpq-dev

macOS:

brew install postgresql

5. Configure pre-commit hooks

pre-commit install

6. Environment configuration

cp .env.example .env
# Edit .env with your API keys and configuration

Quick Test

To run benchmarks, you first need to set environment variables:

Copy the .env.example file to a new file named .env
In the .env file, set the required API keys for various services used by the agents, such as OPENAI_API_KEY and GOOGLE_MAPS_API_KEY

To execute a benchmark programmatically:

from mcpuniverse.tracer.collectors import MemoryCollector  # You can also use SQLiteCollector
from mcpuniverse.benchmark.runner import BenchmarkRunner

async def test():
    trace_collector = MemoryCollector()
    # Choose a benchmark config file under the folder "mcpuniverse/benchmark/configs"
    benchmark = BenchmarkRunner("dummy/benchmark_1.yaml")
    # Run the specified benchmark
    results = await benchmark.run(trace_collector=trace_collector)
    # Get traces
    trace_id = results[0].task_trace_ids["dummy/tasks/weather.json"]
    trace_records = trace_collector.get(trace_id)

Evaluating LLMs and Agents

This section provides comprehensive instructions for evaluating LLMs and AI agents using the MCP-Universe benchmark suite. The framework supports evaluation across multiple domains including web search, location navigation, browser automation, financial analysis, repository management, and 3D design.

Prerequisites

Before running benchmark evaluations, ensure you have completed the Getting Started section and have the following:

Python: Version 3.10 or higher
Docker: Installed and available in your environment
All required dependencies installed via pip install -r requirements.txt
Active virtual environment
Appropriate API access for the services you intend to evaluate

Environment Configuration

Initial Setup

Copy the environment template and configure your API credentials:

cp .env.example .env

API Keys and Configuration

Configure the following environment variables in your .env file. The required keys depend on which benchmark domains you plan to evaluate:

Core LLM Providers

Environment Variable	Provider	Description	Required For
`OPENAI_API_KEY`	OpenAI	API key for GPT models (gpt-5, etc.)	All domains
`ANTHROPIC_API_KEY`	Anthropic	API key for Claude models	All domains
`GEMINI_API_KEY`	Google	API key for Gemini models	All domains

Domain-Specific Services

Environment Variable	Service	Description	Setup Instructions
`SERP_API_KEY`	SerpAPI	Web search API for search benchmark evaluation	Get API key
`GOOGLE_MAPS_API_KEY`	Google Maps	Geolocation and mapping services	Setup Guide
`GITHUB_PERSONAL_ACCESS_TOKEN`	GitHub	Personal access token for repository operations	Token Setup
`GITHUB_PERSONAL_ACCOUNT_NAME`	GitHub	Your GitHub username	N/A
`NOTION_API_KEY`	Notion	Integration token for Notion workspace access	Integration Setup
`NOTION_ROOT_PAGE`	Notion	Root page ID for your Notion workspace	See configuration example below

System Paths

Environment Variable	Description	Example
`BLENDER_APP_PATH`	Full path to Blender executable (we used v4.4.0)	`/Applications/Blender.app/Contents/MacOS/Blender`
`MCPUniverse_DIR`	Absolute path to your MCP-Universe repository	`/Users/username/MCP-Universe`

Configuration Examples

Notion Root Page ID:
If your Notion page URL is:

https://www.notion.so/your_workspace/MCP-Evaluation-1dd6d96e12345678901234567eaf9eff

Set NOTION_ROOT_PAGE=MCP-Evaluation-1dd6d96e12345678901234567eaf9eff

Blender Installation:

Download Blender v4.4.0 from blender.org
Install our modified Blender MCP server following the installation guide
Set the path to the Blender executable

⚠️ Security Recommendations

🔒 IMPORTANT SECURITY NOTICE

Please read and follow these security guidelines carefully before running benchmarks:

🚨 GitHub Integration: CRITICAL - We strongly recommend using a dedicated test GitHub account for benchmark evaluation. The AI agent will perform real operations on GitHub repositories, which could potentially modify or damage your personal repositories.
🔐 API Key Management: Store API keys securely and never commit them to version control. Use environment variables or secure key management systems. Regularly rotate your API keys for enhanced security.
🛡️ Access Permissions: Grant minimal necessary permissions for each service integration. Review and limit API key scopes to only required operations. Monitor API usage and set appropriate rate limits.
⚡ Blender Operations: The 3D design benchmarks will execute Blender commands that may modify or create files on your system. Ensure you have adequate backups and run in an isolated environment if necessary.

Benchmark Configuration

Each benchmark domain has a dedicated YAML configuration file located in mcpuniverse/benchmark/configs/test/. To evaluate your LLM/agent, modify the appropriate configuration file:

Domain	Configuration File	Description
Web Search	`web_search.yaml`	Search engine and information retrieval tasks
Location Navigation	`location_navigation.yaml`	Geographic and mapping-related queries
Browser Automation	`browser_automation.yaml`	Web interaction and automation scenarios
Financial Analysis	`financial_analysis.yaml`	Market data analysis and financial computations
Repository Management	`repository_management.yaml`	Git operations and code repository tasks
3D Design	`3d_design.yaml`	Blender-based 3D modeling and design tasks

Execution

Once you have configured your environment and benchmark settings, you can execute evaluations using the following methods:

Running Individual Benchmarks

Execute specific domain benchmarks using the following commands:

# Set Python path and run individual benchmarks
export PYTHONPATH=.

# Location Navigation
python tests/benchmark/test_benchmark_location_navigation.py

# Browser Automation  
python tests/benchmark/test_benchmark_browser_automation.py

# Financial Analysis
python tests/benchmark/test_benchmark_financial_analysis.py

# Repository Management
python tests/benchmark/test_benchmark_repository_management.py

# Web Search
python tests/benchmark/test_benchmark_web_search.py

# 3D Design
python tests/benchmark/test_benchmark_3d_design.py

Batch Execution

For comprehensive evaluation across all domains:

#!/bin/bash
export PYTHONPATH=.

domains=("location_navigation" "browser_automation" "financial_analysis" 
         "repository_management" "web_search" "3d_design")

for domain in "${domains[@]}"; do
    echo "Running benchmark: $domain"
    python "tests/benchmark/test_benchmark_${domain}.py"
    echo "Completed: $domain"
done

Programmatic Execution

from mcpuniverse.benchmark.runner import BenchmarkRunner
from mcpuniverse.tracer.collectors import MemoryCollector

async def run_benchmark():
    trace_collector = MemoryCollector()
    # Choose a benchmark config file under the folder "mcpuniverse/benchmark/configs"
    benchmark = BenchmarkRunner("dummy/benchmark_1.yaml")
    # Run the specified benchmark
    results = await benchmark.run(trace_collector=trace_collector)
    # Get traces
    trace_id = results[0].task_trace_ids["dummy/tasks/weather.json"]
    trace_records = trace_collector.get(trace_id)
    return results

Save Running Logs

If you want to save the running log, you can pass the trace_collector to the benchmark run function:

from mcpuniverse.tracer.collectors import FileCollector

trace_collector = FileCollector(log_file="log/location_navigation.log")
benchmark_results = await benchmark.run(trace_collector=trace_collector)

Save Benchmark Reports

If you want to save a report of the benchmark result, you can use BenchmarkReport to dump a report:

from mcpuniverse.benchmark.report import BenchmarkReport

report = BenchmarkReport(benchmark, trace_collector=trace_collector)
report.dump()

Visualize Results

To run the benchmark with intermediate results and see real-time progress, pass callbacks=get_vprint_callbacks() to the run function:

from mcpuniverse.callbacks.handlers.vprint import get_vprint_callbacks

benchmark_results = await benchmark.run(
    trace_collector=trace_collector, 
    callbacks=get_vprint_callbacks()
)

This will print out the intermediate results as the benchmark runs.

Generate Reports

python -m mcpuniverse.benchmark.report --results benchmark_results.db --output report.html

Launch Dashboard

python -m mcpuniverse.dashboard.app

LLM Configuration

Configure different LLM providers for your agents:

OpenAI Configuration

kind: llm
spec:
  name: gpt-5-medium
  type: openai
  config:
    model_name: gpt-5-medium
    temperature: 0.1
    max_tokens: 4096

Anthropic Configuration

kind: llm
spec:
  name: claude-4-sonnet
  type: anthropic
  config:
    model_name: claude-4.0-sonnet
    temperature: 0.1
    max_tokens: 4096

Google Configuration

kind: llm
spec:
  name: gemini-pro
  type: google
  config:
    model_name: gemini-pro
    temperature: 0.1
    max_tokens: 4096

Agent Configuration

Configure different agent types for your evaluations:

ReAct Agent

kind: agent
spec:
  name: react-agent
  type: react
  config:
    llm: gpt-5-medium
    max_iterations: 10
    reflection_enabled: true

Function Call Agent

kind: agent
spec:
  name: function-call-agent
  type: function_call
  config:
    llm: claude-4-sonnet
    parallel_calls: true

Basic Agent

kind: agent
spec:
  name: basic-agent
  type: basic
  config:
    llm: gemini-pro
    instruction: "You are a helpful assistant."

Workflow Configuration

Configure workflows to orchestrate multiple agents:

Sequential Workflow

kind: workflow
spec:
  name: sequential-workflow
  type: chain
  config:
    agents:
      - react-agent
      - function-call-agent
    coordination: sequential

Orchestrator Workflow

kind: workflow
spec:
  name: orchestrator-workflow
  type: orchestrator
  config:
    llm: gpt-5-medium
    agents:
      - basic-agent
      - function-call-agent

Creating Custom Benchmarks

A benchmark is defined by three main configuration elements: the task definition, agent/workflow definition, and the benchmark configuration itself. Below is an example using a simple "weather forecasting" task.

Task Definition

The task definition is provided in JSON format, for example:

{
  "category": "general",
  "question": "What's the weather in San Francisco now?",
  "mcp_servers": [
    {
      "name": "weather"
    }
  ],
  "output_format": {
    "city": "<City>",
    "weather": "<Weather forecast results>"
  },
  "evaluators": [
    {
      "func": "json -> get(city)",
      "op": "=",
      "value": "San Francisco"
    }
  ]
}

Field Descriptions:

category: The task category, e.g., "general", "google-maps", etc. You can set any value for this property.
question: The main question you want to ask in this task. This is treated as a user message.
mcp_servers: A list of MCP servers that are supported in this framework.
output_format: The desired output format of agent responses.
evaluators: A list of tests to evaluate. For each test/evaluator, it has three attributes: "func" indicates how to extract values from the agent response, "op" is the comparison operator, and "value" is the ground-truth value.

Predefined Functions

In "evaluators", you need to write a rule ("func" attribute) showing how to extract values for testing. There are several predefined functions in this repo:

json: Perform JSON decoding.
get: Get the value of a key.
len: Get the length of a list.
foreach: Do a FOR-EACH loop.

For example, let's define:

data = {"x": [{"y": [1]}, {"y": [1, 1]}, {"y": [1, 2, 3, 4]}]}

Then get(x) -> foreach -> get(y) -> len will do the following:

Get the value of "x": [{"y": [1]}, {"y": [1, 1]}, {"y": [1, 2, 3, 4]}].
Do a foreach loop and get the value of "y": [[1], [1, 1], [1, 2, 3, 4]].
Get the length of each list: [1, 2, 4].

If these predefined functions are not enough, you can implement custom ones. For more details, please check this documentation.

Benchmark Definition

Define agent(s) and benchmark in a YAML file. Here's a simple weather forecast benchmark:

kind: llm
spec:
  name: llm-1
  type: openai
  config:
    model_name: gpt-4o

---
kind: agent
spec:
  name: ReAct-agent
  type: react
  config:
    llm: llm-1
    instruction: You are an agent for weather forecasting.
    servers:
      - name: weather

---
kind: benchmark
spec:
  description: Test the agent for weather forecasting
  agent: ReAct-agent
  tasks:
    - dummy/tasks/weather.json

The benchmark definition mainly contains two parts: the agent definition and the benchmark configuration. The benchmark configuration is simple—you just need to specify the agent to use (by the defined agent name) and a list of tasks to evaluate.

Configuration Steps:

Specify LLMs: Begin by declaring the large language models (LLMs) you want the agents to use. Each LLM component must be assigned a unique name (e.g., "llm-1").
Define an agent: Next, define an agent by providing its name and selecting an agent class. Agent classes are available in the mcpuniverse.agent package. Commonly used classes include "basic", "function-call", and "react".
Create complex workflows: Beyond simple agents, the framework supports the definition of sophisticated, orchestrated workflows where multiple agents interact or collaborate to solve more complex tasks.

Complex Workflows

For example, here's a complex workflow with multiple agents:

kind: llm
spec:
  name: llm-1
  type: openai
  config:
    model_name: gpt-4o

---
kind: agent
spec:
  name: basic-agent
  type: basic
  config:
    llm: llm-1
    instruction: Return the latitude and the longitude of a place.

---
kind: agent
spec:
  name: function-call-agent
  type: function-call
  config:
    llm: llm-1
    instruction: You are an agent for weather forecast. Please return the weather today at the given latitude and longitude.
    servers:
      - name: weather

---
kind: workflow
spec:
  name: orchestrator-workflow
  type: orchestrator
  config:
    llm: llm-1
    agents:
      - basic-agent
      - function-call-agent

---
kind: benchmark
spec:
  description: Test the agent for weather forecasting
  agent: orchestrator-workflow
  tasks:
    - dummy/tasks/weather.json

For further details, refer to the in-code documentation or existing configuration samples in the repository.

System Architecture

This section provides a comprehensive overview of the MCP-Universe system architecture, including its core components, design patterns, and how the different layers interact.

Core Components

1. Agent Layer (`mcpuniverse/agent/`)

The agent layer is the core of MCP-Universe, providing different types of AI agents that can reason, act, and interact with external tools.

BaseAgent

Purpose: Abstract base class that all agents inherit from
Key Features:
- MCP server connection management
- Tool execution capabilities
- Configuration management
- Tracing and debugging support
- Lifecycle management (initialize, execute, cleanup)

Agent Types

BasicAgent: Simple LLM interaction agent for straightforward tasks
ReActAgent: Implements reasoning and acting pattern with iterative thinking
FunctionCallAgent: Uses native LLM tool calling APIs
ReflectionAgent: Self-reflective agent with memory capabilities
ClaudeCodeAgent: Specialized agent for code-related tasks

2. MCP Layer (`mcpuniverse/mcp/`)

Manages Model Control Protocol servers and clients, enabling agents to interact with external tools and services.

MCPManager

Purpose: Central management of MCP server configurations and client connections
Key Features:
- Server configuration loading from JSON
- Dynamic server registration and management
- Client building for stdio and SSE transports
- Environment variable templating
- Parameter validation

Built-in MCP Servers

Weather: National Weather Service API integration
Google Search: Search functionality via SERP API
Google Maps: Location and navigation services
Google Sheets: Spreadsheet operations
Wikipedia: Knowledge base access
Blender: 3D modeling operations
Yahoo Finance: Financial data access
GitHub: Repository management
Playwright: Web browser automation

3. LLM Layer (`mcpuniverse/llm/`)

Provides unified interface to multiple language model providers.

Supported Providers

OpenAI: GPT models (GPT-4o, GPT-4o-mini, etc.)
Anthropic: Claude models (Claude-3.5-Sonnet, Claude-3-Opus, etc.)
Google: Gemini models (Gemini-2.0-Flash, Gemini-1.5-Pro, etc.)
Mistral: Mistral AI models
Ollama: Local model serving
Grok: xAI's Grok models
DeepSeek: DeepSeek models

4. Workflow Layer (`mcpuniverse/workflows/`)

Orchestrates complex multi-agent interactions and task execution patterns.

Workflow Types

Chain: Sequential agent execution
Router: Conditional agent selection based on input
Parallelization: Concurrent agent execution
Orchestrator: Complex multi-agent coordination
EvaluatorOptimizer: Agent performance optimization

5. Benchmark Layer (`mcpuniverse/benchmark/`)

Comprehensive system for evaluating agent performance across different tasks and domains.

Supported Benchmark Domains

Location Navigation: Google Maps integration for navigation tasks
Web Search: Information retrieval and search tasks
Browser Automation: Web interaction and automation scenarios
Financial Analysis: Market data analysis and financial computations
Repository Management: Git operations and code repository tasks
3D Design: Blender-based 3D modeling and design tasks

Data Flow

Agent Execution Flow

User Input → Agent.execute() → LLM.generate() → Tool Execution → Response
     ↓              ↓              ↓              ↓              ↓
  Callbacks ←   Tracer     ←   Callbacks   ←   MCP Client  ←  Evaluation

MCP Tool Execution Flow

Agent → MCPManager → MCPClient → MCP Server → External Service
  ↑         ↑           ↑           ↑             ↑
Config    Server     Transport   Tool Call    API/Service
Loading   Selection   Protocol    Execution    Response

Benchmark Execution Flow

YAML Config → BenchmarkRunner → Task Execution → Evaluation → Results
     ↓              ↓               ↓              ↓          ↓
  Task Def    Agent Creation   Agent Execution  Scoring   Report Gen

Configuration Management

Hierarchical Configuration

Global: Framework-level settings
Agent: Agent-specific configurations
Server: MCP server configurations
Task: Benchmark task definitions

Environment Variable Templating

{
  "env": {
    "API_KEY": "{{MY_API_KEY}}",
    "PORT": "{{PORT}}"
  }
}

This modular architecture provides a solid foundation for AI agent development while maintaining flexibility, scalability, and extensibility.

Adding MCP Servers

This guide explains how to add new Model Control Protocol (MCP) servers to the MCP-Universe framework. There are three main approaches: creating custom Python MCP servers, integrating existing third-party servers, and connecting to remote MCP servers.

Overview

MCP-Universe uses a centralized server configuration system that manages different types of MCP servers. All server configurations are stored in mcpuniverse/mcp/configs/server_list.json, which defines how to launch, connect to, and manage each server.

1. Adding Custom Python MCP Servers

Step 1: Create Your Server Implementation

Create a new directory in mcpuniverse/mcp/servers/ for your server:

mkdir mcpuniverse/mcp/servers/my_custom_server

Create the server implementation file:

"""
A custom MCP server implementation
"""
import click
from typing import Any
from mcp.server.fastmcp import FastMCP
from mcpuniverse.common.logger import get_logger

def build_server(port: int) -> FastMCP:
    """Initialize the MCP server."""
    mcp = FastMCP("my-custom-server", port=port)
    logger = get_logger("my-custom-server")
    
    @mcp.tool()
    async def my_custom_tool(param1: str, param2: int = 10) -> str:
        """
        Description of what this tool does.
        
        Args:
            param1: Description of parameter 1
            param2: Description of parameter 2 (optional)
            
        Returns:
            Result description
        """
        logger.info(f"Executing custom tool with {param1} and {param2}")
        result = f"Processed {param1} with value {param2}"
        return result
    
    return mcp

Step 2: Register Your Server

Add your server configuration to mcpuniverse/mcp/configs/server_list.json:

{
  "my-custom-server": {
    "stdio": {
      "command": "python3",
      "args": [
        "-m", "mcpuniverse.mcp.servers.my_custom_server"
      ]
    },
    "sse": {
      "command": "python3", 
      "args": [
        "-m", "mcpuniverse.mcp.servers.my_custom_server",
        "--transport", "sse",
        "--port", "{{PORT}}"
      ]
    },
    "env": {
      "CUSTOM_API_KEY": "{{CUSTOM_API_KEY}}",
      "CUSTOM_CONFIG": "{{CUSTOM_CONFIG_PATH}}"
    }
  }
}

Step 3: Usage in Agents

Use your server in agent configurations:

kind: agent
spec:
  name: test-agent
  type: function-call
  config:
    llm: gpt-4o-llm
    instruction: An agent that uses my custom server
    servers:
      - name: my-custom-server
      - name: weather  # Can combine with other servers

2. Adding Third-Party MCP Servers

NPM/Node.js Packages

For servers published as NPM packages:

{
  "github": {
    "stdio": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-github"
      ]
    },
    "env": {
      "GITHUB_PERSONAL_ACCESS_TOKEN": "{{GITHUB_PERSONAL_ACCESS_TOKEN}}"
    }
  },
  
  "filesystem": {
    "stdio": {
      "command": "npx",
      "args": [
        "-y", 
        "@modelcontextprotocol/server-filesystem",
        "{{FILESYSTEM_DIRECTORY}}"
      ]
    }
  }
}

Python Packages

For Python packages available via pip:

{
  "calculator": {
    "stdio": {
      "command": "python3",
      "args": [
        "-m", "mcp_server_calculator"
      ]
    }
  },
  
  "fetch": {
    "stdio": {
      "command": "python3", 
      "args": [
        "-m", "mcp_server_fetch",
        "--ignore-robots-txt"
      ]
    }
  }
}

3. Environment Variables and Configuration

Create a .env file in your project root:

# Third-party API keys
GITHUB_PERSONAL_ACCESS_TOKEN=your_github_token_here
GOOGLE_MAPS_API_KEY=your_google_maps_key
SERP_API_KEY=your_serp_api_key

# Custom server configurations
CUSTOM_API_KEY=your_custom_api_key
FILESYSTEM_DIRECTORY=/path/to/allowed/directory

# Remote server authentication
REMOTE_API_TOKEN=your_remote_token

Template Variables

The server configuration supports template variables that are replaced at runtime:

{{PORT}}: Automatically assigned port for SSE transport
Any environment variable in {{VARIABLE_NAME}} format

Best Practices

Documentation: Document your tools with clear descriptions and parameter types
Error Handling: Implement proper error handling in your server tools
Testing: Write comprehensive tests for your server functionality
Security: Never hardcode API keys; always use environment variables
Performance: Consider async operations for I/O bound tasks
Logging: Use structured logging for debugging and monitoring

Custom Agent Implementation

This guide explains how to implement custom agents in the MCP-Universe framework, building upon the existing agent architecture to create specialized AI agents.

Architecture Overview

MCP-Universe provides a flexible agent framework with the following core components:

BaseAgent: Abstract base class that all agents inherit from
BaseAgentConfig: Configuration class for agent parameters
AgentResponse: Standardized response format
MCPManager: Manages connections to MCP servers
Tracer: Handles execution tracing and debugging

Step 1: Define Your Agent Configuration

Create a configuration class that extends BaseAgentConfig:

from dataclasses import dataclass
from mcpuniverse.agent.base import BaseAgentConfig

@dataclass
class MyCustomAgentConfig(BaseAgentConfig):
    """Configuration for your custom agent."""
    
    # Add custom configuration parameters
    max_retries: int = 3
    temperature: float = 0.7
    enable_memory: bool = True
    custom_prompt_path: str = "custom_prompt.j2"
    
    # You can override default values
    system_prompt: str = "path/to/your/custom_system_prompt.j2"
    max_iterations: int = 10

Step 2: Implement Your Custom Agent Class

Create your agent class by inheriting from BaseAgent:

from typing import Optional, Union, Dict, List
from mcpuniverse.agent.base import BaseAgent
from mcpuniverse.agent.types import AgentResponse
from mcpuniverse.mcp.manager import MCPManager
from mcpuniverse.llm.base import BaseLLM
from mcpuniverse.tracer import Tracer

class MyCustomAgent(BaseAgent):
    """A custom agent implementation."""
    
    # Required class attributes
    config_class = MyCustomAgentConfig
    alias = ["custom", "my-agent"]  # Alternative names
    
    def __init__(
        self,
        mcp_manager: Optional[MCPManager] = None,
        llm: BaseLLM = None,
        config: Optional[Union[Dict, str]] = None,
        **kwargs
    ):
        """Initialize your custom agent."""
        super().__init__(mcp_manager=mcp_manager, llm=llm, config=config)
        
        # Initialize any custom attributes
        self._custom_memory = []
        self._retry_count = 0
    
    async def _execute(
        self,
        message: Union[str, List[str]],
        **kwargs
    ) -> AgentResponse:
        """Main execution method - implement your agent logic here."""
        
        # Get tracer for debugging
        tracer = kwargs.get("tracer", Tracer())
        callbacks = kwargs.get("callbacks", [])
        
        # Implement your custom reasoning logic
        response = await self._custom_reasoning_loop(
            message, tracer, callbacks
        )
        
        return AgentResponse(
            name=self._name,
            class_name=self.__class__.__name__,
            response=response,
            trace_id=tracer.trace_id
        )

Step 3: Create Custom Prompt Templates

Create Jinja2 templates for your agent's prompts:

You are a specialized AI agent designed for {{INSTRUCTION}}.

{% if TOOLS_PROMPT is defined and TOOLS_PROMPT|length %}
{{TOOLS_PROMPT}}

When you need to use tools, respond with this JSON format:
{
    "server": "server-name",
    "tool": "tool-name", 
    "arguments": {"key": "value"}
}
{% endif %}

Follow these guidelines:
1. Be thorough in your analysis
2. Use tools when additional information is needed
3. Provide clear, actionable responses
4. If uncertain, ask clarifying questions

Step 4: Testing Your Custom Agent

Create tests for your agent:

import pytest
from mcpuniverse.agent.my_custom_agent import MyCustomAgent
from mcpuniverse.llm.manager import ModelManager
from mcpuniverse.mcp.manager import MCPManager

@pytest.mark.asyncio
async def test_custom_agent():
    # Setup    
    agent = MyCustomAgent(
        mcp_manager=MCPManager(),
        llm=ModelManager().build_model(name="openai"),
        config={"name": "test-agent", "instruction": "Test agent"}
    )
    
    # Test initialization
    await agent.initialize()
    # Test execution
    response = await agent.execute(message="Hello, world!")
    assert response.name == "test-agent"
    assert isinstance(response.response, str)
    # Cleanup
    await agent.cleanup()

Best Practices

Configuration Management: Use YAML files for configuration and support environment variable substitution
Error Handling: Implement comprehensive error handling with meaningful error messages
Logging: Use the framework's logging system for debugging and monitoring
Resource Cleanup: Always implement proper cleanup in the _cleanup method
Memory Management: Consider memory usage for long-running agents
Testing: Write comprehensive tests for your agent's functionality
Documentation: Document your agent's capabilities, configuration options, and usage examples

Custom Evaluators Implementation

This guide provides comprehensive documentation for implementing custom evaluators in MCP-Universe. Evaluators are essential components that assess agent performance against specific criteria and validation rules.

Evaluator System Overview

The evaluator system consists of two main function types:

Evaluation Functions: Transform and extract data from agent responses
Comparison Functions: Compare processed data against expected values

Implementation Steps

Step 1: Create Module Structure

mkdir mcpuniverse/evaluator/my_domain
touch mcpuniverse/evaluator/my_domain/__init__.py
touch mcpuniverse/evaluator/my_domain/functions.py

Step 2: Implement Evaluation Functions

from mcpuniverse.evaluator.functions import eval_func, FunctionResult

@eval_func(name="extract_score")
async def extract_score(x: FunctionResult, *args, **kwargs) -> FunctionResult:
    """Extract numerical score from response."""
    if isinstance(x, FunctionResult):
        data = x.result
        if isinstance(data, dict) and 'score' in data:
            return FunctionResult(result=float(data['score']))
        elif isinstance(data, str):
            # Try to extract number from string
            import re
            match = re.search(r'\d+\.?\d*', data)
            if match:
                return FunctionResult(result=float(match.group()))
    raise ValueError("Could not extract score from input")

Step 3: Implement Comparison Functions

from mcpuniverse.evaluator.functions import compare_func

@compare_func(name="score_threshold")
async def score_threshold(a: Any, b: Any, *args, **kwargs) -> tuple[bool, str]:
    """Check if score meets threshold."""
    if isinstance(a, FunctionResult):
        a = a.result
    if isinstance(b, FunctionResult):
        b = b.result
    
    threshold = float(b)
    score = float(a)
    
    if score >= threshold:
        return True, ""
    return False, f"Score {score} below threshold {threshold}"

Built-in Evaluation Functions

Function	Purpose	Usage Example
`json`	Parse JSON string	`"json"`
`get(key)`	Extract dictionary value	`"json -> get(city)"`
`len`	Get array/string length	`"json -> get(items) -> len"`
`foreach`	Iterate over arrays	`"json -> get(routes) -> foreach -> get(name)"`

Task Configuration Example

{
    "category": "ecommerce",
    "question": "Calculate the final price for a shopping cart with discount",
    "evaluators": [
        {
            "func": "json -> extract_score",
            "op": "score_threshold",
            "value": 0.8,
            "op_args": {
                "threshold": 0.8
            }
        }
    ]
}

MCP-Universe Documentation

Abstract

Key Statistics

Benchmark Domains

Location Navigation

Web Searching

Browser Automation

3D Design

Financial Analysis

Repository Management

Architecture Overview

Prerequisites

Installation

1. Clone the repository

2. Create and activate virtual environment

3. Install dependencies

4. Platform-specific requirements

5. Configure pre-commit hooks

6. Environment configuration

Quick Test

Evaluating LLMs and Agents

Prerequisites

Environment Configuration

Initial Setup

API Keys and Configuration

Core LLM Providers

Domain-Specific Services

System Paths

Configuration Examples

⚠️ Security Recommendations

Benchmark Configuration

Execution

Running Individual Benchmarks

Batch Execution

Programmatic Execution

Save Running Logs

Save Benchmark Reports

Visualize Results

Generate Reports

Launch Dashboard

LLM Configuration

OpenAI Configuration

Anthropic Configuration

Google Configuration

Agent Configuration

ReAct Agent

Function Call Agent

Basic Agent

Workflow Configuration

Sequential Workflow

Orchestrator Workflow

Creating Custom Benchmarks

Task Definition

Field Descriptions:

Predefined Functions

Benchmark Definition

Configuration Steps:

Complex Workflows

System Architecture

Core Components

1. Agent Layer (mcpuniverse/agent/)

BaseAgent

Agent Types

2. MCP Layer (mcpuniverse/mcp/)

MCPManager

Built-in MCP Servers

3. LLM Layer (mcpuniverse/llm/)

Supported Providers

4. Workflow Layer (mcpuniverse/workflows/)

Workflow Types

5. Benchmark Layer (mcpuniverse/benchmark/)

Supported Benchmark Domains

Data Flow

Agent Execution Flow

MCP Tool Execution Flow

Benchmark Execution Flow

Configuration Management

Hierarchical Configuration

Environment Variable Templating

Adding MCP Servers

1. Agent Layer (`mcpuniverse/agent/`)

2. MCP Layer (`mcpuniverse/mcp/`)

3. LLM Layer (`mcpuniverse/llm/`)

4. Workflow Layer (`mcpuniverse/workflows/`)

5. Benchmark Layer (`mcpuniverse/benchmark/`)