Contribute to MCP-Universe

Help us expand and improve the MCP-Universe benchmark by contributing new tasks, evaluators, and enhancements. Your contributions will help advance AI agent evaluation and benefit the entire research community.

Ways to Contribute

There are many ways you can contribute to the MCP-Universe project. Whether you're a researcher, developer, or domain expert, your contributions are valuable.

New Benchmark Tasks

Create new evaluation tasks that test different aspects of AI agent capabilities. Tasks should be challenging, realistic, and cover diverse scenarios within existing or new domains.

  • Design tasks that reflect real-world use cases
  • Ensure tasks are challenging but fair
  • Include comprehensive evaluation criteria
  • Provide clear expected outputs and formats

Custom Evaluators

Develop specialized evaluators that can assess agent performance on specific types of tasks. Evaluators should be robust, reliable, and provide meaningful feedback.

  • Static evaluators for time-invariant content
  • Dynamic evaluators for real-time data validation
  • Format evaluators for output structure compliance
  • Domain-specific evaluation logic

MCP Server Integration

Integrate new MCP servers to expand the range of tools and services available to agents during evaluation.

  • Identify useful external services and APIs
  • Create MCP server implementations
  • Ensure proper authentication and rate limiting
  • Document server capabilities and limitations

Bug Reports & Improvements

Help improve the framework by reporting bugs, suggesting enhancements, and contributing code improvements.

  • Report issues with detailed reproduction steps
  • Suggest performance optimizations
  • Improve documentation and examples
  • Enhance user experience and accessibility

Contribution Guidelines

Before You Start
Please read our Contributing Guide and Code of Conduct before making your first contribution.

Quality Standards

  • Clarity: Tasks and evaluators should have clear, unambiguous specifications
  • Reproducibility: All contributions should be reproducible across different environments
  • Documentation: Provide comprehensive documentation for new features
  • Testing: Include appropriate tests for new functionality
  • Performance: Consider the performance impact of new additions

Review Process

All contributions go through a review process to ensure quality and consistency:

  1. Submit a pull request with your changes
  2. Automated tests and linting checks will run
  3. Core maintainers will review your contribution
  4. Address any feedback or requested changes
  5. Once approved, your contribution will be merged

Contributing Tasks

Task Structure

Each task is defined as a JSON file with the following structure:

{
    "category": "general",
    "question": "Your task description here...",
    "output_format": {
        "answer": "[Expected output format]"
    },
    "use_specified_server": true,
    "mcp_servers": [
        {
            "name": "server-name"
        }
    ],
    "evaluators": [
        {
            "func": "evaluation_function",
            "op": "operation",
            "value": "expected_value"
        }
    ]
}

Key Components

  • category: Classification of the task type
  • question: The actual task description or prompt
  • output_format: Expected structure of the agent's response
  • mcp_servers: List of MCP servers the agent can use
  • evaluators: Evaluation criteria and functions

Creating Tasks

Design the Task

Start by identifying a real-world scenario that would benefit from AI agent automation. Consider the following:

  • What domain does this task belong to?
  • What tools or services would an agent need?
  • What makes this task challenging?
  • How can success be measured objectively?

Write the Task Description

Create a clear, detailed description of what the agent needs to accomplish:

  • Use natural language that a human would understand
  • Include all necessary context and constraints
  • Specify the expected output format
  • Make the task challenging but achievable

Define Output Format

Specify the exact structure you expect from the agent's response:

"output_format": {
    "result": "[Main answer]",
    "confidence": "[Confidence level]",
    "sources": ["[List of sources used]"]
}

Select MCP Servers

Choose which MCP servers the agent should have access to:

"mcp_servers": [
    {"name": "google-search"},
    {"name": "fetch"},
    {"name": "calculator"}
]

Design Evaluators

Create evaluation criteria that can automatically assess the agent's performance:

"evaluators": [
    {
        "func": "json",
        "op": "contains_key",
        "value": "result"
    },
    {
        "func": "custom_evaluator",
        "op": "validate_answer",
        "op_args": {"expected": "correct_answer"}
    }
]

Task Validation

Testing Required
All new tasks must be tested with at least one agent to ensure they are solvable and the evaluators work correctly.

Validation Checklist

  • Task description is clear and unambiguous
  • Output format is well-defined and reasonable
  • Required MCP servers are available and functional
  • Evaluators correctly assess agent performance
  • Task is challenging but solvable
  • JSON syntax is valid and follows schema

Testing Your Task

python -m mcpuniverse.benchmark.runner \
    --config your_task_config.yaml \
    --task path/to/your/task.json \
    --agent test-agent

Contributing Evaluators

Types of Evaluators

Format Evaluators

Validate that agent outputs conform to expected formats (JSON, XML, specific schemas).

{
    "func": "json",
    "op": "validate_schema",
    "op_args": {"schema": "path/to/schema.json"}
}

Static Evaluators

Compare agent outputs against known, time-invariant correct answers.

{
                    "func": "static_match",
                    "op": "exact_match",
                    "value": "expected_answer"
}

Dynamic Evaluators

Fetch real-time ground truth data to validate time-sensitive responses.

{
    "func": "dynamic_check",
    "op": "api_validation",
    "op_args": {"endpoint": "https://api.example.com/validate"}
}

Creating Custom Evaluators

Evaluator Function Template

def custom_evaluator(output, expected, **kwargs):
    """
    Custom evaluator function.
    
    Args:
        output: Agent's output to evaluate
        expected: Expected correct output
        **kwargs: Additional parameters
    
    Returns:
        dict: Evaluation result with score and details
    """
    try:
        # Your evaluation logic here
        score = calculate_score(output, expected)
        
        return {
            "score": score,
            "passed": score >= threshold,
            "details": "Evaluation details",
            "error": None
        }
    except Exception as e:
        return {
            "score": 0.0,
            "passed": False,
            "details": "Evaluation failed",
            "error": str(e)
        }

Best Practices

  • Handle edge cases and malformed inputs gracefully
  • Provide detailed feedback in the evaluation results
  • Use appropriate scoring metrics for your domain
  • Include comprehensive error handling
  • Document your evaluator's behavior and limitations

Testing Evaluators

Unit Testing

def test_custom_evaluator():
    # Test correct output
    result = custom_evaluator("correct answer", "correct answer")
    assert result["passed"] == True
    assert result["score"] == 1.0
    
    # Test incorrect output
    result = custom_evaluator("wrong answer", "correct answer")
    assert result["passed"] == False
    assert result["score"] < 1.0
    
    # Test malformed input
    result = custom_evaluator(None, "correct answer")
    assert result["error"] is not None

Development Setup

Environment Setup

# Clone the repository
git clone https://github.com/SalesforceAIResearch/MCP-Universe.git
cd MCP-Universe

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt
pip install -r dev-requirements.txt

# Install pre-commit hooks
pre-commit install

Development Workflow

  1. Create a new branch for your feature
  2. Make your changes following the coding standards
  3. Write tests for new functionality
  4. Run the test suite to ensure nothing is broken
  5. Submit a pull request with a clear description

Testing

Running Tests

# Run all tests
python -m pytest

# Run specific test file
python -m pytest tests/test_evaluators.py

# Run with coverage
python -m pytest --cov=mcpuniverse

Test Categories

  • Unit Tests: Test individual functions and classes
  • Integration Tests: Test component interactions
  • End-to-End Tests: Test complete workflows
  • Performance Tests: Ensure acceptable performance

Submission Process

Prepare Your Contribution

  • Ensure your code follows the project's style guidelines
  • Write comprehensive tests for new functionality
  • Update documentation as needed
  • Test your changes thoroughly

Create Pull Request

  • Fork the repository and create a feature branch
  • Commit your changes with clear, descriptive messages
  • Push to your fork and create a pull request
  • Fill out the pull request template completely

Review Process

  • Automated checks will run on your pull request
  • Core maintainers will review your code
  • Address any feedback or requested changes
  • Once approved, your contribution will be merged
Ready to Contribute?
Join our Discord community to discuss ideas, get help, and connect with other contributors. We're excited to see what you'll build!