Contribute to MCP-Universe
Help us expand and improve the MCP-Universe benchmark by contributing new tasks, evaluators, and enhancements. Your contributions will help advance AI agent evaluation and benefit the entire research community.
Ways to Contribute
There are many ways you can contribute to the MCP-Universe project. Whether you're a researcher, developer, or domain expert, your contributions are valuable.
New Benchmark Tasks
Create new evaluation tasks that test different aspects of AI agent capabilities. Tasks should be challenging, realistic, and cover diverse scenarios within existing or new domains.
- Design tasks that reflect real-world use cases
- Ensure tasks are challenging but fair
- Include comprehensive evaluation criteria
- Provide clear expected outputs and formats
Custom Evaluators
Develop specialized evaluators that can assess agent performance on specific types of tasks. Evaluators should be robust, reliable, and provide meaningful feedback.
- Static evaluators for time-invariant content
- Dynamic evaluators for real-time data validation
- Format evaluators for output structure compliance
- Domain-specific evaluation logic
MCP Server Integration
Integrate new MCP servers to expand the range of tools and services available to agents during evaluation.
- Identify useful external services and APIs
- Create MCP server implementations
- Ensure proper authentication and rate limiting
- Document server capabilities and limitations
Bug Reports & Improvements
Help improve the framework by reporting bugs, suggesting enhancements, and contributing code improvements.
- Report issues with detailed reproduction steps
- Suggest performance optimizations
- Improve documentation and examples
- Enhance user experience and accessibility
Contribution Guidelines
Quality Standards
- Clarity: Tasks and evaluators should have clear, unambiguous specifications
- Reproducibility: All contributions should be reproducible across different environments
- Documentation: Provide comprehensive documentation for new features
- Testing: Include appropriate tests for new functionality
- Performance: Consider the performance impact of new additions
Review Process
All contributions go through a review process to ensure quality and consistency:
- Submit a pull request with your changes
- Automated tests and linting checks will run
- Core maintainers will review your contribution
- Address any feedback or requested changes
- Once approved, your contribution will be merged
Contributing Tasks
Task Structure
Each task is defined as a JSON file with the following structure:
{
"category": "general",
"question": "Your task description here...",
"output_format": {
"answer": "[Expected output format]"
},
"use_specified_server": true,
"mcp_servers": [
{
"name": "server-name"
}
],
"evaluators": [
{
"func": "evaluation_function",
"op": "operation",
"value": "expected_value"
}
]
}
Key Components
- category: Classification of the task type
- question: The actual task description or prompt
- output_format: Expected structure of the agent's response
- mcp_servers: List of MCP servers the agent can use
- evaluators: Evaluation criteria and functions
Creating Tasks
Design the Task
Start by identifying a real-world scenario that would benefit from AI agent automation. Consider the following:
- What domain does this task belong to?
- What tools or services would an agent need?
- What makes this task challenging?
- How can success be measured objectively?
Write the Task Description
Create a clear, detailed description of what the agent needs to accomplish:
- Use natural language that a human would understand
- Include all necessary context and constraints
- Specify the expected output format
- Make the task challenging but achievable
Define Output Format
Specify the exact structure you expect from the agent's response:
"output_format": {
"result": "[Main answer]",
"confidence": "[Confidence level]",
"sources": ["[List of sources used]"]
}
Select MCP Servers
Choose which MCP servers the agent should have access to:
"mcp_servers": [
{"name": "google-search"},
{"name": "fetch"},
{"name": "calculator"}
]
Design Evaluators
Create evaluation criteria that can automatically assess the agent's performance:
"evaluators": [
{
"func": "json",
"op": "contains_key",
"value": "result"
},
{
"func": "custom_evaluator",
"op": "validate_answer",
"op_args": {"expected": "correct_answer"}
}
]
Task Validation
Validation Checklist
- Task description is clear and unambiguous
- Output format is well-defined and reasonable
- Required MCP servers are available and functional
- Evaluators correctly assess agent performance
- Task is challenging but solvable
- JSON syntax is valid and follows schema
Testing Your Task
python -m mcpuniverse.benchmark.runner \
--config your_task_config.yaml \
--task path/to/your/task.json \
--agent test-agent
Contributing Evaluators
Types of Evaluators
Format Evaluators
Validate that agent outputs conform to expected formats (JSON, XML, specific schemas).
{
"func": "json",
"op": "validate_schema",
"op_args": {"schema": "path/to/schema.json"}
}
Static Evaluators
Compare agent outputs against known, time-invariant correct answers.
{
"func": "static_match",
"op": "exact_match",
"value": "expected_answer"
}
Dynamic Evaluators
Fetch real-time ground truth data to validate time-sensitive responses.
{
"func": "dynamic_check",
"op": "api_validation",
"op_args": {"endpoint": "https://api.example.com/validate"}
}
Creating Custom Evaluators
Evaluator Function Template
def custom_evaluator(output, expected, **kwargs):
"""
Custom evaluator function.
Args:
output: Agent's output to evaluate
expected: Expected correct output
**kwargs: Additional parameters
Returns:
dict: Evaluation result with score and details
"""
try:
# Your evaluation logic here
score = calculate_score(output, expected)
return {
"score": score,
"passed": score >= threshold,
"details": "Evaluation details",
"error": None
}
except Exception as e:
return {
"score": 0.0,
"passed": False,
"details": "Evaluation failed",
"error": str(e)
}
Best Practices
- Handle edge cases and malformed inputs gracefully
- Provide detailed feedback in the evaluation results
- Use appropriate scoring metrics for your domain
- Include comprehensive error handling
- Document your evaluator's behavior and limitations
Testing Evaluators
Unit Testing
def test_custom_evaluator():
# Test correct output
result = custom_evaluator("correct answer", "correct answer")
assert result["passed"] == True
assert result["score"] == 1.0
# Test incorrect output
result = custom_evaluator("wrong answer", "correct answer")
assert result["passed"] == False
assert result["score"] < 1.0
# Test malformed input
result = custom_evaluator(None, "correct answer")
assert result["error"] is not None
Development Setup
Environment Setup
# Clone the repository
git clone https://github.com/SalesforceAIResearch/MCP-Universe.git
cd MCP-Universe
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
pip install -r dev-requirements.txt
# Install pre-commit hooks
pre-commit install
Development Workflow
- Create a new branch for your feature
- Make your changes following the coding standards
- Write tests for new functionality
- Run the test suite to ensure nothing is broken
- Submit a pull request with a clear description
Testing
Running Tests
# Run all tests
python -m pytest
# Run specific test file
python -m pytest tests/test_evaluators.py
# Run with coverage
python -m pytest --cov=mcpuniverse
Test Categories
- Unit Tests: Test individual functions and classes
- Integration Tests: Test component interactions
- End-to-End Tests: Test complete workflows
- Performance Tests: Ensure acceptable performance
Submission Process
Prepare Your Contribution
- Ensure your code follows the project's style guidelines
- Write comprehensive tests for new functionality
- Update documentation as needed
- Test your changes thoroughly
Create Pull Request
- Fork the repository and create a feature branch
- Commit your changes with clear, descriptive messages
- Push to your fork and create a pull request
- Fill out the pull request template completely
Review Process
- Automated checks will run on your pull request
- Core maintainers will review your code
- Address any feedback or requested changes
- Once approved, your contribution will be merged