GitTaskBench Evaluation

What is GitTaskBench?

GitTaskBench is a comprehensive repository-level benchmark and tooling suite designed to evaluate AI agents on real-world coding tasks. Developed by the QuantaAlpha team, it provides:

Real-world tasks that require repository-level understanding
Diverse domains including data processing, ML, web development, and more
Standardized evaluation with reproducible metrics
Task complexity ranging from simple scripts to multi-file projects

GitTaskBench was open-sourced on August 26, 2025 as part of the QuantaAlpha ecosystem, alongside RepoMaster and SE-Agent.

RepoMaster Results

RepoMaster achieves state-of-the-art performance on GitTaskBench:

Execution Rate

75.92%

Successfully executed tasks without errors

Task Pass Rate

62.96%

Tasks completed with correct outputs

Token Usage

154k

Average tokens per task (95% reduction)

Performance Comparison

RepoMaster significantly outperforms existing frameworks on GitTaskBench:

Execution Rate

Framework	Execution Rate	Improvement
RepoMaster	75.92%	Baseline
SWE-Agent	44.44%	+70.9%
OpenHands	-	-

Task Pass Rate

Framework	Pass Rate	Improvement
RepoMaster	62.96%	Baseline
OpenHands	24.07%	+161.5%
SWE-Agent	-	-

Key Insight: RepoMaster achieves a 70.9% improvement in execution rate over SWE-Agent and a 161.5% improvement in pass rate over OpenHands, demonstrating superior ability to understand and execute repository-level tasks.

Token Efficiency

One of RepoMaster’s most significant advantages is its exceptional token efficiency:

RepoMaster:     154,000 tokens/task
Baselines:    3,080,000 tokens/task (estimated)

Reduction:    95% fewer tokens
Cost Savings: ~95% reduction in API costs

This efficiency comes from RepoMaster’s intelligent repository exploration strategy:

Hierarchical Analysis: Uses hybrid structural modeling (HCT, FCG, MDG) to identify core components
Selective Context: Only loads relevant code sections instead of entire repositories
Smart Navigation: Navigates codebases at file/class/function granularity
Adaptive Exploration: Adjusts exploration depth based on task requirements

What GitTaskBench Tests

GitTaskBench evaluates multiple dimensions of repository understanding:

1. Repository Discovery

Finding relevant repositories for task requirements
Evaluating repository quality and suitability
Multi-round query optimization

2. Code Understanding

Reading and comprehending README files
Understanding code structure and dependencies
Identifying core functionality and entry points

3. Task Execution

Setting up execution environments
Installing dependencies correctly
Running code with proper configurations
Handling edge cases and errors

4. Output Generation

Producing correct output formats
Saving results to specified locations
Validating output correctness

Example Tasks

GitTaskBench includes diverse real-world tasks:

PDF Text Extraction

Task: Extract text content from PDF files using repository toolsRequirements:

Find and use appropriate PDF parsing repositories
Handle multi-column layouts and formatting
Save extracted text in specified format

Example Input: PDFPlumber_01/input/PDFPlumber_01_input.pdfExpected Output: Structured text file with preserved formatting

Neural Style Transfer

Task: Apply artistic style transfer to imagesRequirements:

Locate neural style transfer repositories
Configure model parameters
Process content and style images
Generate output with transferred style

Example: Transform portrait into Van Gogh painting style

Web Scraping

Task: Extract structured data from websitesRequirements:

Find suitable web scraping tools
Handle dynamic content and pagination
Parse and structure extracted data
Save results in CSV/JSON format

Example: Scrape product prices and specifications

Image Classification

Task: Train image classifier on provided datasetRequirements:

Use appropriate ML frameworks
Implement data loading and preprocessing
Configure and train model
Generate predictions on test set

Example: CIFAR-10 classification with transfer learning

Running GitTaskBench

You can evaluate RepoMaster on GitTaskBench yourself:

Step 1: Prepare Configuration

Create a task configuration file (YAML format):

repo:
  type: github
  url: https://github.com/example/repository

task_description: |
  Your task description here

task_prompt: |
  ### Task Description
  {task_description}
  
  #### Repository Path: 
  {repo_path}
  
  #### Input Data:
  {input_data}
  
  #### Output Directory:
  {output_dir_path}

input_data:
  - path: /path/to/input/data
    description: Description of input data

parameters:
  max_turns: 20
  use_venv: true

Step 2: Run Evaluation

# Run single task
python -m src.core.git_task --config configs/task_config.yaml

# Run with custom settings
python -m src.core.git_task \
  --config configs/task_config.yaml \
  --retry 2 \
  --root_path ./results

Step 3: View Results

Results are saved in the specified output directory:

results/
├── gitbench_MMDD_HHMM/
│   ├── task_id/
│   │   ├── workspace/
│   │   │   ├── repository/
│   │   │   ├── input_dataset/
│   │   │   └── output_results/
│   │   └── task_info.json

The src.core.git_task module provides the TaskManager, PathManager, DataProcessor, and AgentRunner classes for benchmark execution. See the API Reference for detailed documentation.

Key Findings

Our GitTaskBench evaluation reveals several important insights:

1. Repository Understanding is Critical

Tasks requiring deep repository understanding show the largest performance gaps between RepoMaster and baselines. RepoMaster’s hierarchical analysis and selective context loading enable superior comprehension.

2. Efficiency Enables Scale

The 95% reduction in token usage makes RepoMaster practical for real-world deployment:

Lower costs: Dramatically reduced API expenses
Faster execution: Less data to process means faster responses
Better context: More room for task-specific information

3. Multi-Step Reasoning

Complex tasks requiring multi-step reasoning (e.g., find repo → understand code → execute → validate) benefit most from RepoMaster’s systematic approach.

4. Domain Versatility

Strong performance across diverse domains (ML, data processing, web, CV) demonstrates RepoMaster’s general-purpose capabilities.

Limitations and Future Work

While RepoMaster achieves strong results, some challenges remain:

Complex dependencies: Tasks with intricate dependency chains may require manual intervention
Non-Python repositories: Performance is optimized for Python; other languages show lower success rates
Undocumented code: Repositories with poor documentation are more challenging
Environment setup: Some tasks have complex environment requirements

Future improvements will address these limitations through enhanced dependency resolution, multi-language support, and better environment management.

Citation

If you use GitTaskBench in your research, please cite both RepoMaster and GitTaskBench:

@article{wang2025repomaster,
  title={RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving},
  author={Huacan Wang and Ziyi Ni and Shuo Zhang and Lu, Shuo and Sen Hu and Ziyang He and Chen Hu and Jiaye Lin and Yifu Guo and Ronghao Chen and Xin Li and Daxin Jiang and Yuntao Du and Pin Lyu},
  journal={arXiv preprint arXiv:2505.21577},
  year={2025}
}

Learn More

GitTaskBench Repository

Explore the GitTaskBench benchmark suite

MLE-Bench Results

View RepoMaster’s ML engineering performance

Performance Details

Deep dive into performance analysis

Research Paper

Read the full NeurIPS 2025 paper

Overview

Evaluation

GitTaskBench Evaluation

What is GitTaskBench?

RepoMaster Results

Execution Rate

75.92%

Task Pass Rate

62.96%

Token Usage

154k

Performance Comparison

Execution Rate

Task Pass Rate

Token Efficiency

What GitTaskBench Tests

1. Repository Discovery

2. Code Understanding

3. Task Execution

4. Output Generation

Example Tasks

Running GitTaskBench

Step 1: Prepare Configuration

Step 2: Run Evaluation

Step 3: View Results

Key Findings

1. Repository Understanding is Critical

2. Efficiency Enables Scale

3. Multi-Step Reasoning

4. Domain Versatility

Limitations and Future Work

Citation

Learn More

GitTaskBench Repository

MLE-Bench Results

Performance Details

Research Paper

Build docs developers (and LLMs) love

Overview

Evaluation

Documentation Index

​What is GitTaskBench?

​RepoMaster Results

Execution Rate

​75.92%

Task Pass Rate

​62.96%

Token Usage

​154k

​Performance Comparison

​Execution Rate

​Task Pass Rate

​Token Efficiency

​What GitTaskBench Tests

​1. Repository Discovery

​2. Code Understanding

​3. Task Execution

​4. Output Generation

​Example Tasks

​Running GitTaskBench

​Step 1: Prepare Configuration

​Step 2: Run Evaluation

​Step 3: View Results

​Key Findings

​1. Repository Understanding is Critical

​2. Efficiency Enables Scale

​3. Multi-Step Reasoning

​4. Domain Versatility

​Limitations and Future Work

​Citation

​Learn More

GitTaskBench Repository

MLE-Bench Results

Performance Details

Research Paper

Build docs developers (and LLMs) love

What is GitTaskBench?

RepoMaster Results

75.92%

62.96%

154k

Performance Comparison

Execution Rate

Task Pass Rate

Token Efficiency

What GitTaskBench Tests

1. Repository Discovery

2. Code Understanding

3. Task Execution

4. Output Generation

Example Tasks

Running GitTaskBench

Step 1: Prepare Configuration

Step 2: Run Evaluation

Step 3: View Results

Key Findings

1. Repository Understanding is Critical

2. Efficiency Enables Scale

3. Multi-Step Reasoning

4. Domain Versatility

Limitations and Future Work

Citation

Learn More