Contribute to MCP-Universe

Help us expand and improve the MCP-Universe benchmark by contributing new tasks, evaluators, and enhancements. Your contributions will help advance AI agent evaluation and benefit the entire research community.

View on GitHub Join Discord

Ways to Contribute

There are many ways you can contribute to the MCP-Universe project. Whether you're a researcher, developer, or domain expert, your contributions are valuable.

New Benchmark Tasks

Create new evaluation tasks that test different aspects of AI agent capabilities. Tasks should be challenging, realistic, and cover diverse scenarios within existing or new domains.

Design tasks that reflect real-world use cases
Ensure tasks are challenging but fair
Include comprehensive evaluation criteria
Provide clear expected outputs and formats

Custom Evaluators

Develop specialized evaluators that can assess agent performance on specific types of tasks. Evaluators should be robust, reliable, and provide meaningful feedback.

Static evaluators for time-invariant content
Dynamic evaluators for real-time data validation
Format evaluators for output structure compliance
Domain-specific evaluation logic

MCP Server Integration

Integrate new MCP servers to expand the range of tools and services available to agents during evaluation.

Identify useful external services and APIs
Create MCP server implementations
Ensure proper authentication and rate limiting
Document server capabilities and limitations

Bug Reports & Improvements

Help improve the framework by reporting bugs, suggesting enhancements, and contributing code improvements.

Report issues with detailed reproduction steps
Suggest performance optimizations
Improve documentation and examples
Enhance user experience and accessibility

Contribution Guidelines

Before You Start

Please read our Contributing Guide and Code of Conduct before making your first contribution.

Quality Standards

Clarity: Tasks and evaluators should have clear, unambiguous specifications
Reproducibility: All contributions should be reproducible across different environments
Documentation: Provide comprehensive documentation for new features
Testing: Include appropriate tests for new functionality
Performance: Consider the performance impact of new additions

Review Process

All contributions go through a review process to ensure quality and consistency:

Submit a pull request with your changes
Automated tests and linting checks will run
Core maintainers will review your contribution
Address any feedback or requested changes
Once approved, your contribution will be merged

Contributing Tasks

Task Structure

Each task is defined as a JSON file with the following structure:

{
    "category": "general",
    "question": "Your task description here...",
    "output_format": {
        "answer": "[Expected output format]"
    },
    "use_specified_server": true,
    "mcp_servers": [
        {
            "name": "server-name"
        }
    ],
    "evaluators": [
        {
            "func": "evaluation_function",
            "op": "operation",
            "value": "expected_value"
        }
    ]
}

Key Components

category: Classification of the task type
question: The actual task description or prompt
output_format: Expected structure of the agent's response
mcp_servers: List of MCP servers the agent can use
evaluators: Evaluation criteria and functions

Creating Tasks

Design the Task

Start by identifying a real-world scenario that would benefit from AI agent automation. Consider the following:

What domain does this task belong to?
What tools or services would an agent need?
What makes this task challenging?
How can success be measured objectively?

Write the Task Description

Create a clear, detailed description of what the agent needs to accomplish:

Use natural language that a human would understand
Include all necessary context and constraints
Specify the expected output format
Make the task challenging but achievable

Define Output Format

Specify the exact structure you expect from the agent's response:

"output_format": {
    "result": "[Main answer]",
    "confidence": "[Confidence level]",
    "sources": ["[List of sources used]"]
}

Select MCP Servers

Choose which MCP servers the agent should have access to:

"mcp_servers": [
    {"name": "google-search"},
    {"name": "fetch"},
    {"name": "calculator"}
]

Design Evaluators

Create evaluation criteria that can automatically assess the agent's performance:

"evaluators": [
    {
        "func": "json",
        "op": "contains_key",
        "value": "result"
    },
    {
        "func": "custom_evaluator",
        "op": "validate_answer",
        "op_args": {"expected": "correct_answer"}
    }
]

Task Validation

Testing Required

All new tasks must be tested with at least one agent to ensure they are solvable and the evaluators work correctly.

Validation Checklist

Task description is clear and unambiguous
Output format is well-defined and reasonable
Required MCP servers are available and functional
Evaluators correctly assess agent performance
Task is challenging but solvable
JSON syntax is valid and follows schema

Testing Your Task

python -m mcpuniverse.benchmark.runner \
    --config your_task_config.yaml \
    --task path/to/your/task.json \
    --agent test-agent

Contributing Evaluators

Types of Evaluators

Format Evaluators

Validate that agent outputs conform to expected formats (JSON, XML, specific schemas).

{
    "func": "json",
    "op": "validate_schema",
    "op_args": {"schema": "path/to/schema.json"}
}

Static Evaluators

Compare agent outputs against known, time-invariant correct answers.

{
                    "func": "static_match",
                    "op": "exact_match",
                    "value": "expected_answer"
}

Dynamic Evaluators

Fetch real-time ground truth data to validate time-sensitive responses.

{
    "func": "dynamic_check",
    "op": "api_validation",
    "op_args": {"endpoint": "https://api.example.com/validate"}
}

Creating Custom Evaluators

Evaluator Function Template

def custom_evaluator(output, expected, **kwargs):
    """
    Custom evaluator function.
    
    Args:
        output: Agent's output to evaluate
        expected: Expected correct output
        **kwargs: Additional parameters
    
    Returns:
        dict: Evaluation result with score and details
    """
    try:
        # Your evaluation logic here
        score = calculate_score(output, expected)
        
        return {
            "score": score,
            "passed": score >= threshold,
            "details": "Evaluation details",
            "error": None
        }
    except Exception as e:
        return {
            "score": 0.0,
            "passed": False,
            "details": "Evaluation failed",
            "error": str(e)
        }

Best Practices

Handle edge cases and malformed inputs gracefully
Provide detailed feedback in the evaluation results
Use appropriate scoring metrics for your domain
Include comprehensive error handling
Document your evaluator's behavior and limitations

Testing Evaluators

Unit Testing

def test_custom_evaluator():
    # Test correct output
    result = custom_evaluator("correct answer", "correct answer")
    assert result["passed"] == True
    assert result["score"] == 1.0
    
    # Test incorrect output
    result = custom_evaluator("wrong answer", "correct answer")
    assert result["passed"] == False
    assert result["score"] < 1.0
    
    # Test malformed input
    result = custom_evaluator(None, "correct answer")
    assert result["error"] is not None

Development Setup

Environment Setup

# Clone the repository
git clone https://github.com/SalesforceAIResearch/MCP-Universe.git
cd MCP-Universe

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt
pip install -r dev-requirements.txt

# Install pre-commit hooks
pre-commit install

Development Workflow

Create a new branch for your feature
Make your changes following the coding standards
Write tests for new functionality
Run the test suite to ensure nothing is broken
Submit a pull request with a clear description

Testing

Running Tests

# Run all tests
python -m pytest

# Run specific test file
python -m pytest tests/test_evaluators.py

# Run with coverage
python -m pytest --cov=mcpuniverse

Test Categories

Unit Tests: Test individual functions and classes
Integration Tests: Test component interactions
End-to-End Tests: Test complete workflows
Performance Tests: Ensure acceptable performance

Submission Process

Prepare Your Contribution

Ensure your code follows the project's style guidelines
Write comprehensive tests for new functionality
Update documentation as needed
Test your changes thoroughly

Create Pull Request

Fork the repository and create a feature branch
Commit your changes with clear, descriptive messages
Push to your fork and create a pull request
Fill out the pull request template completely

Review Process

Automated checks will run on your pull request
Core maintainers will review your code
Address any feedback or requested changes
Once approved, your contribution will be merged

Ready to Contribute?

Join our Discord community to discuss ideas, get help, and connect with other contributors. We're excited to see what you'll build!