Benchmarks Introduction

Overview

RepoMaster has been rigorously evaluated on multiple benchmarks to demonstrate its effectiveness in autonomous repository exploration and task execution. Our evaluation covers two major benchmarks that test different aspects of the system:

GitTaskBench: Repository-level benchmark for real-world coding tasks
MLE-Bench: Machine learning engineering benchmark from OpenAI

These benchmarks validate RepoMaster’s ability to understand, explore, and execute complex tasks using GitHub repositories.

Key Performance Highlights

RepoMaster achieves state-of-the-art performance across multiple metrics:

GitTaskBench

75.92% Execution Rate62.96% Task Pass Rate154k Average Token Usage

MLE-Bench

95.45% Valid Submissions27.27% Medal Rate22.73% Gold Medals

Why These Benchmarks Matter

GitTaskBench

GitTaskBench is a comprehensive repository-level benchmark designed to evaluate AI agents on real-world coding tasks. It tests an agent’s ability to:

Discover relevant repositories from GitHub
Understand complex codebases with minimal documentation
Execute tasks that require repository-level understanding
Adapt to diverse programming languages and frameworks

RepoMaster’s performance on GitTaskBench demonstrates its practical applicability to real-world software engineering tasks.

MLE-Bench

MLE-Bench, created by OpenAI, evaluates machine learning engineering capabilities through Kaggle-style competitions. It assesses:

Data processing and feature engineering skills
Model development and experimentation
Submission generation and validation
End-to-end ML pipeline execution

Our strong performance on MLE-Bench shows RepoMaster’s ability to handle complex ML workflows autonomously.

Evaluation Methodology

Our benchmark evaluation follows rigorous standards:

Reproducibility: All experiments can be reproduced using configurations in configs/
Fair Comparison: We compare against published baselines using identical setups
Comprehensive Metrics: We report multiple metrics to provide a complete picture
Transparency: Full results and analysis are available in our research paper

The benchmarking code and configurations are available in the RepoMaster repository. You can run the benchmarks yourself using:

python -m src.core.git_task --config configs/benchmark_config.yaml

Comparison with Baselines

RepoMaster significantly outperforms existing frameworks:

Framework	GitTaskBench Execution Rate	GitTaskBench Pass Rate
RepoMaster	75.92%	62.96%
OpenHands	-	24.07%
SWE-Agent	44.44%	-

Token Efficiency: RepoMaster achieves a 95% reduction in token usage compared to existing frameworks, averaging only 154k tokens per task. This dramatic improvement in efficiency makes RepoMaster practical for real-world deployment.

Research Publication

Our work has been accepted to NeurIPS 2025 as a Spotlight paper (top 3.2% of submissions):

RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories

Authors: Huacan Wang, Ziyi Ni, Shuo Zhang, Lu Shuo, Sen Hu, Ziyang He, Chen Hu, Jiaye Lin, Yifu Guo, Ronghao Chen, Xin Li, Daxin Jiang, Yuntao Du, Pin LyuConference: NeurIPS 2025 (Spotlight)arXiv: 2505.21577

Next Steps

GitTaskBench Details

Explore detailed GitTaskBench results and analysis

MLE-Bench Results

View MLE-Bench evaluation and methodology

Performance Analysis

Deep dive into performance metrics and comparisons

Run Benchmarks

Learn how to run benchmarks yourself

Overview

Evaluation

Documentation Index

​Overview

​Key Performance Highlights

GitTaskBench

MLE-Bench

​Why These Benchmarks Matter

​GitTaskBench

​MLE-Bench

​Evaluation Methodology

​Comparison with Baselines

​Research Publication

RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories

​Next Steps

GitTaskBench Details

MLE-Bench Results

Performance Analysis

Run Benchmarks

Build docs developers (and LLMs) love

Overview

Key Performance Highlights

Why These Benchmarks Matter

GitTaskBench

MLE-Bench

Evaluation Methodology

Comparison with Baselines

Research Publication

Next Steps