What is GitTaskBench?
GitTaskBench is a comprehensive repository-level benchmark and tooling suite designed to evaluate AI agents on real-world coding tasks. Developed by the QuantaAlpha team, it provides:- Real-world tasks that require repository-level understanding
- Diverse domains including data processing, ML, web development, and more
- Standardized evaluation with reproducible metrics
- Task complexity ranging from simple scripts to multi-file projects
GitTaskBench was open-sourced on August 26, 2025 as part of the QuantaAlpha ecosystem, alongside RepoMaster and SE-Agent.
RepoMaster Results
RepoMaster achieves state-of-the-art performance on GitTaskBench:Execution Rate
75.92%
Successfully executed tasks without errorsTask Pass Rate
62.96%
Tasks completed with correct outputsToken Usage
154k
Average tokens per task (95% reduction)Performance Comparison
RepoMaster significantly outperforms existing frameworks on GitTaskBench:Execution Rate
| Framework | Execution Rate | Improvement |
|---|---|---|
| RepoMaster | 75.92% | Baseline |
| SWE-Agent | 44.44% | +70.9% |
| OpenHands | - | - |
Task Pass Rate
| Framework | Pass Rate | Improvement |
|---|---|---|
| RepoMaster | 62.96% | Baseline |
| OpenHands | 24.07% | +161.5% |
| SWE-Agent | - | - |
Key Insight: RepoMaster achieves a 70.9% improvement in execution rate over SWE-Agent and a 161.5% improvement in pass rate over OpenHands, demonstrating superior ability to understand and execute repository-level tasks.
Token Efficiency
One of RepoMaster’s most significant advantages is its exceptional token efficiency:- Hierarchical Analysis: Uses hybrid structural modeling (HCT, FCG, MDG) to identify core components
- Selective Context: Only loads relevant code sections instead of entire repositories
- Smart Navigation: Navigates codebases at file/class/function granularity
- Adaptive Exploration: Adjusts exploration depth based on task requirements
What GitTaskBench Tests
GitTaskBench evaluates multiple dimensions of repository understanding:1. Repository Discovery
- Finding relevant repositories for task requirements
- Evaluating repository quality and suitability
- Multi-round query optimization
2. Code Understanding
- Reading and comprehending README files
- Understanding code structure and dependencies
- Identifying core functionality and entry points
3. Task Execution
- Setting up execution environments
- Installing dependencies correctly
- Running code with proper configurations
- Handling edge cases and errors
4. Output Generation
- Producing correct output formats
- Saving results to specified locations
- Validating output correctness
Example Tasks
GitTaskBench includes diverse real-world tasks:PDF Text Extraction
PDF Text Extraction
Task: Extract text content from PDF files using repository toolsRequirements:
- Find and use appropriate PDF parsing repositories
- Handle multi-column layouts and formatting
- Save extracted text in specified format
PDFPlumber_01/input/PDFPlumber_01_input.pdfExpected Output: Structured text file with preserved formattingNeural Style Transfer
Neural Style Transfer
Task: Apply artistic style transfer to imagesRequirements:
- Locate neural style transfer repositories
- Configure model parameters
- Process content and style images
- Generate output with transferred style
Web Scraping
Web Scraping
Task: Extract structured data from websitesRequirements:
- Find suitable web scraping tools
- Handle dynamic content and pagination
- Parse and structure extracted data
- Save results in CSV/JSON format
Image Classification
Image Classification
Task: Train image classifier on provided datasetRequirements:
- Use appropriate ML frameworks
- Implement data loading and preprocessing
- Configure and train model
- Generate predictions on test set
Running GitTaskBench
You can evaluate RepoMaster on GitTaskBench yourself:Step 1: Prepare Configuration
Create a task configuration file (YAML format):Step 2: Run Evaluation
Step 3: View Results
Results are saved in the specified output directory:The
src.core.git_task module provides the TaskManager, PathManager, DataProcessor, and AgentRunner classes for benchmark execution. See the API Reference for detailed documentation.Key Findings
Our GitTaskBench evaluation reveals several important insights:1. Repository Understanding is Critical
Tasks requiring deep repository understanding show the largest performance gaps between RepoMaster and baselines. RepoMaster’s hierarchical analysis and selective context loading enable superior comprehension.2. Efficiency Enables Scale
The 95% reduction in token usage makes RepoMaster practical for real-world deployment:- Lower costs: Dramatically reduced API expenses
- Faster execution: Less data to process means faster responses
- Better context: More room for task-specific information
3. Multi-Step Reasoning
Complex tasks requiring multi-step reasoning (e.g., find repo → understand code → execute → validate) benefit most from RepoMaster’s systematic approach.4. Domain Versatility
Strong performance across diverse domains (ML, data processing, web, CV) demonstrates RepoMaster’s general-purpose capabilities.Limitations and Future Work
While RepoMaster achieves strong results, some challenges remain:- Complex dependencies: Tasks with intricate dependency chains may require manual intervention
- Non-Python repositories: Performance is optimized for Python; other languages show lower success rates
- Undocumented code: Repositories with poor documentation are more challenging
- Environment setup: Some tasks have complex environment requirements
Citation
If you use GitTaskBench in your research, please cite both RepoMaster and GitTaskBench:Learn More
GitTaskBench Repository
Explore the GitTaskBench benchmark suite
MLE-Bench Results
View RepoMaster’s ML engineering performance
Performance Details
Deep dive into performance analysis
Research Paper
Read the full NeurIPS 2025 paper