Skip to main content

What is GitTaskBench?

GitTaskBench is a comprehensive repository-level benchmark and tooling suite designed to evaluate AI agents on real-world coding tasks. Developed by the QuantaAlpha team, it provides:
  • Real-world tasks that require repository-level understanding
  • Diverse domains including data processing, ML, web development, and more
  • Standardized evaluation with reproducible metrics
  • Task complexity ranging from simple scripts to multi-file projects
GitTaskBench was open-sourced on August 26, 2025 as part of the QuantaAlpha ecosystem, alongside RepoMaster and SE-Agent.

RepoMaster Results

RepoMaster achieves state-of-the-art performance on GitTaskBench:

Execution Rate

75.92%

Successfully executed tasks without errors

Task Pass Rate

62.96%

Tasks completed with correct outputs

Token Usage

154k

Average tokens per task (95% reduction)

Performance Comparison

RepoMaster significantly outperforms existing frameworks on GitTaskBench:

Execution Rate

FrameworkExecution RateImprovement
RepoMaster75.92%Baseline
SWE-Agent44.44%+70.9%
OpenHands--

Task Pass Rate

FrameworkPass RateImprovement
RepoMaster62.96%Baseline
OpenHands24.07%+161.5%
SWE-Agent--
Key Insight: RepoMaster achieves a 70.9% improvement in execution rate over SWE-Agent and a 161.5% improvement in pass rate over OpenHands, demonstrating superior ability to understand and execute repository-level tasks.

Token Efficiency

One of RepoMaster’s most significant advantages is its exceptional token efficiency:
RepoMaster:     154,000 tokens/task
Baselines:    3,080,000 tokens/task (estimated)

Reduction:    95% fewer tokens
Cost Savings: ~95% reduction in API costs
This efficiency comes from RepoMaster’s intelligent repository exploration strategy:
  1. Hierarchical Analysis: Uses hybrid structural modeling (HCT, FCG, MDG) to identify core components
  2. Selective Context: Only loads relevant code sections instead of entire repositories
  3. Smart Navigation: Navigates codebases at file/class/function granularity
  4. Adaptive Exploration: Adjusts exploration depth based on task requirements

What GitTaskBench Tests

GitTaskBench evaluates multiple dimensions of repository understanding:

1. Repository Discovery

  • Finding relevant repositories for task requirements
  • Evaluating repository quality and suitability
  • Multi-round query optimization

2. Code Understanding

  • Reading and comprehending README files
  • Understanding code structure and dependencies
  • Identifying core functionality and entry points

3. Task Execution

  • Setting up execution environments
  • Installing dependencies correctly
  • Running code with proper configurations
  • Handling edge cases and errors

4. Output Generation

  • Producing correct output formats
  • Saving results to specified locations
  • Validating output correctness

Example Tasks

GitTaskBench includes diverse real-world tasks:
Task: Extract text content from PDF files using repository toolsRequirements:
  • Find and use appropriate PDF parsing repositories
  • Handle multi-column layouts and formatting
  • Save extracted text in specified format
Example Input: PDFPlumber_01/input/PDFPlumber_01_input.pdfExpected Output: Structured text file with preserved formatting
Task: Apply artistic style transfer to imagesRequirements:
  • Locate neural style transfer repositories
  • Configure model parameters
  • Process content and style images
  • Generate output with transferred style
Example: Transform portrait into Van Gogh painting style
Task: Extract structured data from websitesRequirements:
  • Find suitable web scraping tools
  • Handle dynamic content and pagination
  • Parse and structure extracted data
  • Save results in CSV/JSON format
Example: Scrape product prices and specifications
Task: Train image classifier on provided datasetRequirements:
  • Use appropriate ML frameworks
  • Implement data loading and preprocessing
  • Configure and train model
  • Generate predictions on test set
Example: CIFAR-10 classification with transfer learning

Running GitTaskBench

You can evaluate RepoMaster on GitTaskBench yourself:

Step 1: Prepare Configuration

Create a task configuration file (YAML format):
repo:
  type: github
  url: https://github.com/example/repository

task_description: |
  Your task description here

task_prompt: |
  ### Task Description
  {task_description}
  
  #### Repository Path: 
  {repo_path}
  
  #### Input Data:
  {input_data}
  
  #### Output Directory:
  {output_dir_path}

input_data:
  - path: /path/to/input/data
    description: Description of input data

parameters:
  max_turns: 20
  use_venv: true

Step 2: Run Evaluation

# Run single task
python -m src.core.git_task --config configs/task_config.yaml

# Run with custom settings
python -m src.core.git_task \
  --config configs/task_config.yaml \
  --retry 2 \
  --root_path ./results

Step 3: View Results

Results are saved in the specified output directory:
results/
├── gitbench_MMDD_HHMM/
│   ├── task_id/
│   │   ├── workspace/
│   │   │   ├── repository/
│   │   │   ├── input_dataset/
│   │   │   └── output_results/
│   │   └── task_info.json
The src.core.git_task module provides the TaskManager, PathManager, DataProcessor, and AgentRunner classes for benchmark execution. See the API Reference for detailed documentation.

Key Findings

Our GitTaskBench evaluation reveals several important insights:

1. Repository Understanding is Critical

Tasks requiring deep repository understanding show the largest performance gaps between RepoMaster and baselines. RepoMaster’s hierarchical analysis and selective context loading enable superior comprehension.

2. Efficiency Enables Scale

The 95% reduction in token usage makes RepoMaster practical for real-world deployment:
  • Lower costs: Dramatically reduced API expenses
  • Faster execution: Less data to process means faster responses
  • Better context: More room for task-specific information

3. Multi-Step Reasoning

Complex tasks requiring multi-step reasoning (e.g., find repo → understand code → execute → validate) benefit most from RepoMaster’s systematic approach.

4. Domain Versatility

Strong performance across diverse domains (ML, data processing, web, CV) demonstrates RepoMaster’s general-purpose capabilities.

Limitations and Future Work

While RepoMaster achieves strong results, some challenges remain:
  • Complex dependencies: Tasks with intricate dependency chains may require manual intervention
  • Non-Python repositories: Performance is optimized for Python; other languages show lower success rates
  • Undocumented code: Repositories with poor documentation are more challenging
  • Environment setup: Some tasks have complex environment requirements
Future improvements will address these limitations through enhanced dependency resolution, multi-language support, and better environment management.

Citation

If you use GitTaskBench in your research, please cite both RepoMaster and GitTaskBench:
@article{wang2025repomaster,
  title={RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving},
  author={Huacan Wang and Ziyi Ni and Shuo Zhang and Lu, Shuo and Sen Hu and Ziyang He and Chen Hu and Jiaye Lin and Yifu Guo and Ronghao Chen and Xin Li and Daxin Jiang and Yuntao Du and Pin Lyu},
  journal={arXiv preprint arXiv:2505.21577},
  year={2025}
}

Learn More

GitTaskBench Repository

Explore the GitTaskBench benchmark suite

MLE-Bench Results

View RepoMaster’s ML engineering performance

Performance Details

Deep dive into performance analysis

Research Paper

Read the full NeurIPS 2025 paper

Build docs developers (and LLMs) love