Skip to main content

Overview

RepoMaster has been rigorously evaluated on multiple benchmarks to demonstrate its effectiveness in autonomous repository exploration and task execution. Our evaluation covers two major benchmarks that test different aspects of the system:
  • GitTaskBench: Repository-level benchmark for real-world coding tasks
  • MLE-Bench: Machine learning engineering benchmark from OpenAI
These benchmarks validate RepoMaster’s ability to understand, explore, and execute complex tasks using GitHub repositories.

Key Performance Highlights

RepoMaster achieves state-of-the-art performance across multiple metrics:

GitTaskBench

75.92% Execution Rate62.96% Task Pass Rate154k Average Token Usage

MLE-Bench

95.45% Valid Submissions27.27% Medal Rate22.73% Gold Medals

Why These Benchmarks Matter

GitTaskBench

GitTaskBench is a comprehensive repository-level benchmark designed to evaluate AI agents on real-world coding tasks. It tests an agent’s ability to:
  • Discover relevant repositories from GitHub
  • Understand complex codebases with minimal documentation
  • Execute tasks that require repository-level understanding
  • Adapt to diverse programming languages and frameworks
RepoMaster’s performance on GitTaskBench demonstrates its practical applicability to real-world software engineering tasks.

MLE-Bench

MLE-Bench, created by OpenAI, evaluates machine learning engineering capabilities through Kaggle-style competitions. It assesses:
  • Data processing and feature engineering skills
  • Model development and experimentation
  • Submission generation and validation
  • End-to-end ML pipeline execution
Our strong performance on MLE-Bench shows RepoMaster’s ability to handle complex ML workflows autonomously.

Evaluation Methodology

Our benchmark evaluation follows rigorous standards:
  1. Reproducibility: All experiments can be reproduced using configurations in configs/
  2. Fair Comparison: We compare against published baselines using identical setups
  3. Comprehensive Metrics: We report multiple metrics to provide a complete picture
  4. Transparency: Full results and analysis are available in our research paper
The benchmarking code and configurations are available in the RepoMaster repository. You can run the benchmarks yourself using:
python -m src.core.git_task --config configs/benchmark_config.yaml

Comparison with Baselines

RepoMaster significantly outperforms existing frameworks:
FrameworkGitTaskBench Execution RateGitTaskBench Pass Rate
RepoMaster75.92%62.96%
OpenHands-24.07%
SWE-Agent44.44%-
Token Efficiency: RepoMaster achieves a 95% reduction in token usage compared to existing frameworks, averaging only 154k tokens per task. This dramatic improvement in efficiency makes RepoMaster practical for real-world deployment.

Research Publication

Our work has been accepted to NeurIPS 2025 as a Spotlight paper (top 3.2% of submissions):

RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories

Authors: Huacan Wang, Ziyi Ni, Shuo Zhang, Lu Shuo, Sen Hu, Ziyang He, Chen Hu, Jiaye Lin, Yifu Guo, Ronghao Chen, Xin Li, Daxin Jiang, Yuntao Du, Pin LyuConference: NeurIPS 2025 (Spotlight)arXiv: 2505.21577

Next Steps

GitTaskBench Details

Explore detailed GitTaskBench results and analysis

MLE-Bench Results

View MLE-Bench evaluation and methodology

Performance Analysis

Deep dive into performance metrics and comparisons

Run Benchmarks

Learn how to run benchmarks yourself

Build docs developers (and LLMs) love