Overview
RepoMaster has been rigorously evaluated on multiple benchmarks to demonstrate its effectiveness in autonomous repository exploration and task execution. Our evaluation covers two major benchmarks that test different aspects of the system:- GitTaskBench: Repository-level benchmark for real-world coding tasks
- MLE-Bench: Machine learning engineering benchmark from OpenAI
Key Performance Highlights
RepoMaster achieves state-of-the-art performance across multiple metrics:GitTaskBench
75.92% Execution Rate62.96% Task Pass Rate154k Average Token Usage
MLE-Bench
95.45% Valid Submissions27.27% Medal Rate22.73% Gold Medals
Why These Benchmarks Matter
GitTaskBench
GitTaskBench is a comprehensive repository-level benchmark designed to evaluate AI agents on real-world coding tasks. It tests an agent’s ability to:- Discover relevant repositories from GitHub
- Understand complex codebases with minimal documentation
- Execute tasks that require repository-level understanding
- Adapt to diverse programming languages and frameworks
MLE-Bench
MLE-Bench, created by OpenAI, evaluates machine learning engineering capabilities through Kaggle-style competitions. It assesses:- Data processing and feature engineering skills
- Model development and experimentation
- Submission generation and validation
- End-to-end ML pipeline execution
Evaluation Methodology
Our benchmark evaluation follows rigorous standards:- Reproducibility: All experiments can be reproduced using configurations in
configs/ - Fair Comparison: We compare against published baselines using identical setups
- Comprehensive Metrics: We report multiple metrics to provide a complete picture
- Transparency: Full results and analysis are available in our research paper
The benchmarking code and configurations are available in the RepoMaster repository. You can run the benchmarks yourself using:
Comparison with Baselines
RepoMaster significantly outperforms existing frameworks:| Framework | GitTaskBench Execution Rate | GitTaskBench Pass Rate |
|---|---|---|
| RepoMaster | 75.92% | 62.96% |
| OpenHands | - | 24.07% |
| SWE-Agent | 44.44% | - |
Token Efficiency: RepoMaster achieves a 95% reduction in token usage compared to existing frameworks, averaging only 154k tokens per task. This dramatic improvement in efficiency makes RepoMaster practical for real-world deployment.
Research Publication
Our work has been accepted to NeurIPS 2025 as a Spotlight paper (top 3.2% of submissions):RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories
Authors: Huacan Wang, Ziyi Ni, Shuo Zhang, Lu Shuo, Sen Hu, Ziyang He, Chen Hu, Jiaye Lin, Yifu Guo, Ronghao Chen, Xin Li, Daxin Jiang, Yuntao Du, Pin LyuConference: NeurIPS 2025 (Spotlight)arXiv: 2505.21577
Next Steps
GitTaskBench Details
Explore detailed GitTaskBench results and analysis
MLE-Bench Results
View MLE-Bench evaluation and methodology
Performance Analysis
Deep dive into performance metrics and comparisons
Run Benchmarks
Learn how to run benchmarks yourself